I am writing a thesis which references mBERT a lot. And would be really great to know data sizes of English, Nepali, and Hindi used for training. In other papers, they mention in either ranges or total size which include all of the languages. However, I just wanted for these three. Also, wikipedia mentions its English data is 300 GB of size which I don't think mBERT was trained on. Anybody knows the sizes?
I am writing a thesis which references mBERT a lot. And would be really great to know data sizes of English, Nepali, and Hindi used for training. In other papers, they mention in either ranges or total size which include all of the languages. However, I just wanted for these three. Also, wikipedia mentions its English data is 300 GB of size which I don't think mBERT was trained on. Anybody knows the sizes?