google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.23k stars 9.62k forks source link

[Question] What was the size of English, Nepali, and Hindi data, multilingual BERT cased was trained on? #1364

Open mani-rai opened 2 years ago

mani-rai commented 2 years ago

I am writing a thesis which references mBERT a lot. And would be really great to know data sizes of English, Nepali, and Hindi used for training. In other papers, they mention in either ranges or total size which include all of the languages. However, I just wanted for these three. Also, wikipedia mentions its English data is 300 GB of size which I don't think mBERT was trained on. Anybody knows the sizes?