Open eu9ene opened 3 months ago
Also, Flores is in Cyrillic and mtdata_Neulab-tedtalks is in Latin :) So we should not use the latter for evals.
Serbian, Bosnian, Croatian and Montenegrin, all into English, can be solved with a single Latin model for all Serbo-Croatian. Adding cyrtranslit to transliterate Cyrilic into latin. But if transliteraton can't happen on the browser execution, I would advise in doing two models: hbs_cyr->eng
and hbs_lat->eng
. Having different variants in the decoder is not advised, though. Unless they are tagged with special tokens, so the model can learn the differences.
In MaCoCu, we released data for the four variants and separated by script. That can be useful.
I think we can port cyrtranslit to JavaScript and add it to the Gecko implementation. It's MIT licensed, and looks fairly straightforward.
I guess the other issue is the web is pretty messy, and it's hard to know how what script the page is using, especially if it's mixed. If we can transliterate, I think that's a safer option if we still get good results.
That transliterator can work with mixed scripts perfectly.
After a quick investigation, I see that the original parallel corpus was filtered from 70M to 35M sentences.
Serbian is digraphic and uses both Latin and Cyrillic scripts.
I see that datasets like NLLB include translations in both scripts: https://opus.nlpl.eu/sample/en&sr/NLLB&v1/sample Our final training corpus also includes sentences in both scripts, which means fast-text language identification filter recognizes both of them:
Based on Wikipedia Cyrillic script is more official. Google Translate also translates into Cyrillic script.
I think we should implement conversion of the training data from Latin to Cyrillic similar to Chinese (#741).