google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.95k stars 9.57k forks source link

Generating vocabulary file for or after pretraining BERT from base #1006

Open abdullahkhilji opened 4 years ago

abdullahkhilji commented 4 years ago

Pretraining BERT from base requires the vocabulary vocab.txt. Does this vocab.txt needs to be the exhaustive intersection vocabulary from the base and the domain-specific corpus we would be training on or only the vocab from the base. If former is the case then is it good to generate a list of new vocab.txt from https://github.com/kwonmha/bert-vocab-builder and concatenating with the bases version of vocab.txt for pretraining and checking for and removing duplicates if any?

nirav0999 commented 4 years ago

Did you find a solution for this? I have the exact same doubt.

abdullahkhilji commented 4 years ago

Thanks for asking @nirav0999 Yes, the vocab.txt needs to be exhaustive. You may manually include the words which are not present in the pretrained version of vocab.txt, I would be happy if someone disagrees with me on this. But concatenating directly is not a good idea. There are certain blank values in vocab.txt which can be filled with the newer values.

Crescentz commented 4 years ago

the same question. [unused]s are not enough

abdullahkhilji commented 4 years ago

Given the limited embedding size, I don't think so you can use extra (after exhausting the [unused]s) vocabulary and the pre-trained model as well.