huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.51k stars 26.9k forks source link

Some question about training BERT after change the Vocab.txt size #3097

Closed WenTingTseng closed 4 years ago

WenTingTseng commented 4 years ago

The original bert-base-chinese-vocab.txt size is 21128. I use my own vocab.txt size is 44900 When I try to train Fine-tune BERT Model using BertForMaskedLM it have some promble about size mismatch. I have try to change the BertEmbeddings its self.word_embeddings config.vocab_size to 44900 like this self.word_embeddings = nn.Embedding(44900, config.hidden_size, padding_idx=0) But it still have problem like this

RuntimeError: Error(s) in loading state_dict for BertForMaskedLM:
        size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([21128, 768]) from checkpoint, the shape in current model is torch.Size([44900, 768]).
        size mismatch for cls.predictions.bias: copying a param with shape torch.Size([21128]) from checkpoint, the shape in current model is torch.Size([44900]).
        size mismatch for cls.predictions.decoder.weight: copying a param with shape torch.Size([21128, 768]) from checkpoint, the shape in current model is torch.Size([44900, 768]).

I do not sure how to fix it. I have think about if I need to change the pre trained BERT config.json file? Its vocab_size from 21128 to 44900

LysandreJik commented 4 years ago

You can't really change the vocabulary without re-training the whole model. Is there some overlap between the BERT vocabulary and your custom vocabulary? If so, you can add the 20k+ tokens using add_tokens (which will probably slow down things, as that's a lot of added tokens).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

divyashreepathihalli commented 2 years ago

you can now do this. Keras (tf-nightly version) has added a new util keras.utils.warmstart_embedding_matrix. Using this you can continuously train your model with changing vocabulary. https://www.tensorflow.org/api_docs/python/tf/keras/utils/warmstart_embedding_matrix