neuralmind-ai / portuguese-bert

Portuguese pre-trained BERT models
Other
792 stars 122 forks source link

Initiating from multilingual BERT weights #1

Closed manueltonneau closed 4 years ago

manueltonneau commented 4 years ago

Hi and thanks for the cool contribution :)

I'm in the process of pre-training BERT as well and am wondering how you managed to initiate the training with the weights of Multilingual BERT. Based on your paper, you created a vocabulary from scratch of size 30K and added all punctuation characters from the Multilingual vocabulary. Still, your final vocabulary size is different from the total Mutlilingual vocabulary size, right? In that case, how did you manage to initiate from the Multilingual weights with a vocabulary not as big as the multilingual vocabulary?

Thanks a lot in advance for the insights!

fabiocapsouza commented 4 years ago

Hi Manuel,

Yes, the Portuguese and the Multilingual vocabularies are different. If I am not mistaken, the vocabulary size, V, affects only the input embedding weights of size (V, H) and the MLM prediction bias of size (V,). This way, all other layers, such as positional embeddings, token type embeddings and all transformer encoder layers, are borrowed from the Multilingual checkpoint and only these 2 layers are initiated randomly.

manueltonneau commented 4 years ago

Cool, thanks for the swift reply :) Is there some available implementation of how to do this in practice? Also, did you benchmark a model trained from scratch on your data with a model initiated from mBERT trained on the same data? I see how it could improve the accuracy but am wondering to what extent it does!

Thanks a lot in advance!

manueltonneau commented 4 years ago

Hi Fabio! I managed to initiate the input embeddings and MLM prediction bias randomly but am now getting the following error "Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint ". When looking at the list of variables from both the mBERT checkpoint and my past checkpoint, it seems that the adam variables are missing from the mBERT checkpoint. Did you find a cool workaround for this? Thanks a lot in advance :)

fabiocapsouza commented 4 years ago

Hi Manuel, sorry for the late response. Regarding the benchmark of starting from scratch, we did not try it, since initiating from multilingual BERT worked well. I agree it would be nice to know if it helped or not. Let us know if you try it :)

As for the missing variables in checkpoint, have you found a solution? I did not face any similar errors.

manueltonneau commented 4 years ago

Thanks for your reply! I trained from scratch on a smaller corpus (not brWAC) but the results were not yet satisfactory. Regarding the missing variables, no, I unfortunately didn't.. Let me know when you open source your code! :)

manueltonneau commented 4 years ago

May I ask which mBERT checkpoint you used? Is that the one mentioned in the repo's README from Nov 23rd 2018?

fabiocapsouza commented 4 years ago

Yes, that is correct.