mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

English to Serbian has low quality of the teacher models #765

Open eu9ene opened 3 months ago

eu9ene commented 3 months ago

After a quick investigation, I see that the original parallel corpus was filtered from 70M to 35M sentences.

Serbian is digraphic and uses both Latin and Cyrillic scripts.

I see that datasets like NLLB include translations in both scripts: https://opus.nlpl.eu/sample/en&sr/NLLB&v1/sample Our final training corpus also includes sentences in both scripts, which means fast-text language identification filter recognizes both of them:

shuf -n 20 public%2Fbuild%2Fcorpus.sr
Како се наводи, ова група је позната под називима "Стронциум", "Фенци бир" и "АПТ28".
Кад се све тачно поравна, сензор ће радити исправно.
-Sudbine gore od smrti, rekao bih, iz ruka potpuno æelavih malih zelenim ljudi upravo došlih iz svemira i željnih bijelih žena.
Како ћеш другачије да уђеш унутра?
Susedne države su Kostarika i Kolumbija.
HIV je nešto sa čime se živi.
Imam par dobrih ovamo, ovamo naprijed samo za vas.
Уосталом, већина земаља, посебно оне које се ослањају на туризам, снабдијевају туристе.
Odmah sam je i pročitao supruzi.
Kako da prestaneš da sabotiraš sebe
Ne laži Angelinu.
Особа са обманом величином може бити одбацујућа од оних који не прихватају њихово обмањујуће веровање или уверења.
Nema ništa loše u tišini.
"Ne, onaj koji vam je to uradio."
-Je li moguæe da je bila bolesnija nego što ste zakljuèili?
Bojiš se, èega se bojiš?
Па ипак, ако такви проблеми заиста постоје, онда може помоћи само доктор.
Neke od mojih najstarijih eksperimentalnih plantaža sada imaju preko 30 godina.
Voleo bi da nije.
Možete pokušati odbiti jednu maminu želju.

Based on Wikipedia Cyrillic script is more official. Google Translate also translates into Cyrillic script.

I think we should implement conversion of the training data from Latin to Cyrillic similar to Chinese (#741).

eu9ene commented 3 months ago

Also, Flores is in Cyrillic and mtdata_Neulab-tedtalks is in Latin :) So we should not use the latter for evals.

ZJaume commented 1 week ago

Serbian, Bosnian, Croatian and Montenegrin, all into English, can be solved with a single Latin model for all Serbo-Croatian. Adding cyrtranslit to transliterate Cyrilic into latin. But if transliteraton can't happen on the browser execution, I would advise in doing two models: hbs_cyr->eng and hbs_lat->eng. Having different variants in the decoder is not advised, though. Unless they are tagged with special tokens, so the model can learn the differences.

In MaCoCu, we released data for the four variants and separated by script. That can be useful.

gregtatum commented 1 week ago

I think we can port cyrtranslit to JavaScript and add it to the Gecko implementation. It's MIT licensed, and looks fairly straightforward.

gregtatum commented 1 week ago

I guess the other issue is the web is pretty messy, and it's hard to know how what script the page is using, especially if it's mixed. If we can transliterate, I think that's a safer option if we still get good results.

ZJaume commented 1 week ago

That transliterator can work with mixed scripts perfectly.