Closed WenTingTseng closed 4 years ago
You can't really change the vocabulary without re-training the whole model. Is there some overlap between the BERT vocabulary and your custom vocabulary? If so, you can add the 20k+ tokens using add_tokens
(which will probably slow down things, as that's a lot of added tokens).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
you can now do this. Keras (tf-nightly version) has added a new util keras.utils.warmstart_embedding_matrix
. Using this you can continuously train your model with changing vocabulary. https://www.tensorflow.org/api_docs/python/tf/keras/utils/warmstart_embedding_matrix
The original bert-base-chinese-vocab.txt size is 21128. I use my own vocab.txt size is 44900 When I try to train Fine-tune BERT Model using BertForMaskedLM it have some promble about size mismatch. I have try to change the BertEmbeddings its self.word_embeddings config.vocab_size to 44900 like this
self.word_embeddings = nn.Embedding(44900, config.hidden_size, padding_idx=0)
But it still have problem like thisI do not sure how to fix it. I have think about if I need to change the pre trained BERT config.json file? Its vocab_size from 21128 to 44900