Questions about pretraining

htw2012 commented 5 years ago

Hi, I have some questions about pre-training as follows:

I wanna train my own model from scratch and produce the vocab.txt by characters. There are some low-frequency words, should low-frequency words be deleted from the dictionary?
If I train my model from Chinese BERT-Base, shall I add the new words from my corpus to the vocab.txt that BERT-Base released ?
During the pre-training, under the same conditions, the larger training samples, the better accuracy for MLM and NSP?

Thank you in advance.

dvector89 commented 5 years ago

If you delete some low-frequency words in vocab, you should add a UNK in your vocab.
If you change the vocab, you cannot use the released model.

artemisart commented 5 years ago

In fact you can reuse the released model if you add words by replacing the unused tokens (but you will have to train them).

murray-z commented 5 years ago

@artemisart after i add new words by replacing the unused tokens, how should i train these new words ? thank you!

xwzhong commented 5 years ago

google-research / bert