google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.47k stars 9.54k forks source link

Losing Knowledge for Language Model in Fine-Tuning #651

Open PetreanuAndi opened 5 years ago

PetreanuAndi commented 5 years ago

Hello. I'm looking to Fine-Tune BERT MultiLingual on a specific closed-domain context in one of the languages that already exists in the multi-lingual-language set. (Romanian) My task will be to : 1) Fine-Tune Romanian Language Model (in order to include more words in the romanian vocab) 2) Fine-Tune the Resulting "Romanian++" Language-Model on an Intent-classification and NRE task.

OBS : the original vocab used by Google in training this specific language seems to be quite small (I've looked up specific words in the "vocab.txt" and they do not show up. Example : the equivalent of "bill" -> "factura" , a base word with no prefix or suffix ).

Also, I'm thinking that in order to leverage the knowledge of the pre-trained model (and the contextual embeddings down the road), I need to have the SAME input word embeddings for my vocabulary => a match of my word embeddings (the input) and the word embeddings used by Google to train (their input). Otherwise, I'm basically just adding new randomly innitialised embeddings to the system and I'm not actually fine-tuning anything!

Is my assumption correct? :dagger:

The resources found online, including the official resource page (https://github.com/google-research/bert#pre-training-with-bert) all refer to a "pre-training method" (my understanding is that this pre-training method will basically be a trained-from-scratch Language Model for a specific closed-domain corpus, using the masking technique). But if that is true, why all the fuss about being able to FINE-TUNE BERT? (Is it really an ImageNet moment for NLP? From this point of view, not really.... It's like they used specific pixel embeddings and dont plan on releasing them for the public. So if u come with RGB input for pixels, you're not fine-tuning, you're re-training :)

It seems like without the original word-embeddings (word-pair/byte-pair or not), fine-tuning is actually just training from scratch :)

Guys, i'd appreciate any insight into this issue. Here's a good article on the pitfalls of the MultiLingual Vocab and WordPiece Tokenization : https://medium.com/omnius/hallo-multilingual-bert-c%C3%B3mo-funcionas-2b3406cc4dc2

tholor commented 5 years ago

Hey, let's distinguish between

a) adapting a model to a new national language (e.g. Romanian) b) adapting to a certain domain (e.g. legal or healthcare)

For b): It might be enough to take a pretrained model (e.g. english or mulitilingual) and continue training on your domain corpus. We have implemented a function in the pytorch repository for this (https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning). If vocab really doesn't fit your corpus, you can either replace the unused tokens in the current vocabulary (quick & easy!) or extend the vocab to a larger size (quite some effort in the model architecture!). In both cases you would need to learn the embeddings for these new tokens from zero in your Language Model adaptation phase. Don't expect crazy performance boosts by adjusting the vocab. We so far got 2-3 % out of it. However, the whole fine-tuning on domain corpus can have quite some good impact! Especially if your domain language is very different to wikipedia style.

For a): We had a similar problem for German language. We finally decided to train from scratch and open-source the model (see https://deepset.ai/german-bert). It works definitely better than the multilingual, but for some cases (e.g. NER) the multilingual seems to also benefit from learnings across languages. Pretraining from scratch only makes sense if you have +10GB training data in your language, but we are happy to share our learnings if you plan to train a Romanian model.