stefan-it / turkish-bert

Turkish BERT/DistilBERT, ELECTRA and ConvBERT models
494 stars 42 forks source link

How is this model bilingual? #38

Open smtnkc opened 6 months ago

smtnkc commented 6 months ago

Hi Stefan,

When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:

1) Does the training corpus contain English texts? 2) Is it trained from scratch or on the English model's weights?

Thanks!

stefan-it commented 6 months ago

Hi @smtnkc ,

that's a very interesting question. I quickly analyzed the pretraining corpus of the BERTurk model (trained on the 35GB corpus). BERTurk was pretrained from scratch.

It has 299_245_100 training instances (training instance is considered as line in training corpus).

I used fasttext language detection (this model) and counted the number of training instances where English has highest probabilty: 618_394. So 0,5% of the corpus are "real" English training instances.