google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.47k stars 9.53k forks source link

Questions about pretraining #171

Open htw2012 opened 5 years ago

htw2012 commented 5 years ago

Hi, I have some questions about pre-training as follows:

  1. I wanna train my own model from scratch and produce the vocab.txt by characters. There are some low-frequency words, should low-frequency words be deleted from the dictionary?
  2. If I train my model from Chinese BERT-Base, shall I add the new words from my corpus to the vocab.txt that BERT-Base released ?
  3. During the pre-training, under the same conditions, the larger training samples, the better accuracy for MLM and NSP?

Thank you in advance.

dvector89 commented 5 years ago
  1. If you delete some low-frequency words in vocab, you should add a UNK in your vocab.
  2. If you change the vocab, you cannot use the released model.
artemisart commented 5 years ago

In fact you can reuse the released model if you add words by replacing the unused tokens (but you will have to train them).

murray-z commented 5 years ago

@artemisart after i add new words by replacing the unused tokens, how should i train these new words ? thank you!

xwzhong commented 5 years ago

@zhangfazhan https://github.com/google-research/bert/issues/155 here is some advice