Open bsugerman opened 5 years ago
As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings. I obtained better results at a number of tasks using this approach.
As long as the size of the vocabulary remains unchanged, you will be able to continue training from a saved checkpoint. Of course you will have to re-generate the pre-training data after modifying the vocabulary. Then continue pre-training long enough so that the model learns new word embeddings. I obtained better results at a number of tasks using this approach. @gaphex
Hello, can I ask you a question? If I want to use my own corpus to do additional pre-training based on the checkpoint provided by Google, do I need to do the word segmentation and clause processing on my own corpus, and then generate a new vocabulary? I am not using English corpus. I hope I can get your answer.
Hi @gaphex I'm new to BERT and I want to add domain specific vocabulary to the vocabulary of BERT model. I know I have to replace the first 1000 lines with my vocabulary. After adding my domain specific words in those unused lines, how to train the model after that ? can u please share the code ?
@ali4friends71 you could use the code from the Colab notebook, beginning from step 5. Check out the article for further instructions.
@gaphex Thanks alot. So when running the code, I got an error than I don't have access to cloud storages. So should I have to create a GCS bucket and use them while running the code ? And is there any other way to save and load the model other than GCS bucket ?
I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?
@bsugerman Hi, In addition to replace those unused tokens , you can add your new tokens to the original vocab files and concatenated a created embedding tensor for new tokens with original embedding tensor for original tokens
https://github.com/google-research/bert/issues/82#issuecomment-921613967
I am fine-tuning the BERT model but need to add a few thousand words. I know that one can replace the ~1000 [unused#] lines at the top of the vocab.txt, but I also notice there are thousands of single foreign characters (unicode) in the file, which I will never use. For fine-tuning, is it possible to replace those with my words, fine tune, and have model still work correctly?