Can one expand the vocabulary for fine-tuning by replacing foreign unicode characters?

google-research / bert

TensorFlow code and pre-trained models for BERT

https://arxiv.org/abs/1810.04805

Apache License 2.0

38.09k stars 9.59k forks source link

Can one expand the vocabulary for fine-tuning by replacing foreign unicode characters? #419

Open bsugerman opened 5 years ago

bsugerman commented 5 years ago

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?

gaphex commented 5 years ago

As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings. I obtained better results at a number of tasks using this approach.

qiu-nian commented 5 years ago

As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings. I obtained better results at a number of tasks using this approach. @gaphex
Hello, can I ask you a question? If I want to use my own corpus to do additional pre-training based on the checkpoint provided by Google, do I need to do the word segmentation and clause processing on my own corpus, and then generate a new vocabulary? I am not using English corpus. I hope I can get your answer.

ali4friends71 commented 4 years ago

Hi @gaphex I'm new to BERT and I want to add domain specific vocabulary to the vocabulary of BERT model. I know I have to replace the first 1000 lines with my vocabulary. After adding my domain specific words in those unused lines, how to train the model after that ? can u please share the code ?

gaphex commented 4 years ago

@ali4friends71 you could use the code from the Colab notebook, beginning from step 5. Check out the article for further instructions.

ali4friends71 commented 4 years ago

@gaphex Thanks alot. So when running the code, I got an error than I don't have access to cloud storages. So should I have to create a GCS bucket and use them while running the code ? And is there any other way to save and load the model other than GCS bucket ?

Yiwen-Yang-666 commented 3 years ago

I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?

@bsugerman Hi, In addition to replace those unused tokens , you can add your new tokens to the original vocab files and concatenated a created embedding tensor for new tokens with original embedding tensor for original tokens

https://github.com/google-research/bert/issues/82#issuecomment-921613967