Open smtnkc opened 6 months ago
Hi @smtnkc ,
that's a very interesting question. I quickly analyzed the pretraining corpus of the BERTurk model (trained on the 35GB corpus). BERTurk was pretrained from scratch.
It has 299_245_100 training instances (training instance is considered as line in training corpus).
I used fasttext
language detection (this model) and counted the number of training instances where English has highest probabilty: 618_394. So 0,5% of the corpus are "real" English training instances.
Hi Stefan,
When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:
1) Does the training corpus contain English texts? 2) Is it trained from scratch or on the English model's weights?
Thanks!