iPieter / RobBERT

A Dutch RoBERTa-based language model
https://pieter.ai/robbert/
MIT License
196 stars 29 forks source link

How to train Robbert-base? #1

Closed nikhilno1 closed 4 years ago

nikhilno1 commented 4 years ago

Thanks for sharing. I want to train a different language model (Hindi). How did you train your Robbert-base model? Are those steps covered anywhere?

twinters commented 4 years ago

Hi Nikhil! Thanks for your interest!

The steps are somewhat covered in the paper. @iPieter can probably tell you more much details about our exact pre-training steps, but we used similar steps as layed out in https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

For Hindi, it is probably best to also change the tokenizer RoBERTa uses, for which the new HuggingFace tokenizers repository might be really usedful! I also see that there is a 8.9GB data set ready to train on in the OSCAR corpus, so you might want to use that one. We used the Dutch section of that corpus for our RobBERT model; it is basically Common Crawl, but filtered automatically using language detection: https://traces1.inria.fr/oscar/

Keep in mind though that pre-training takes a huge amount of computatatonal power & resources!

Hope this helps!

nikhilno1 commented 4 years ago

Thank you so much for the details. It helps a lot.