Open abdullahkhilji opened 4 years ago
Did you find a solution for this? I have the exact same doubt.
Thanks for asking @nirav0999
Yes, the vocab.txt
needs to be exhaustive. You may manually include the words which are not present in the pretrained version of vocab.txt
, I would be happy if someone disagrees with me on this. But concatenating directly is not a good idea. There are certain blank values in vocab.txt
which can be filled with the newer values.
the same question. [unused]s are not enough
Given the limited embedding size, I don't think so you can use extra (after exhausting the [unused]s) vocabulary and the pre-trained model as well.
Pretraining BERT from
base
requires the vocabularyvocab.txt
. Does thisvocab.txt
needs to be the exhaustive intersection vocabulary from thebase
and the domain-specific corpus we would be training on or only thevocab
from thebase
. If former is the case then is it good to generate a list of newvocab.txt
from https://github.com/kwonmha/bert-vocab-builder and concatenating with thebase
s version ofvocab.txt
for pretraining and checking for and removing duplicates if any?