mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
135 stars 28 forks source link

Investigate multilingual models for similar language groups #684

Open gregtatum opened 1 week ago

gregtatum commented 1 week ago

Bosnian-Croatian-Montenegrin-Serbian are a language group that are very similar, and mutually intelligible. Per Wikipedia:

It is a pluricentric language with four[16] mutually intelligible standard varieties, namely Serbian, Croatian, Bosnian, and Montenegrin

We should investigate shipping a multilingual model for other languages like this. Valencian and Catalan are both fairly similar and could use the same trick. I'd be curious if we combined a bunch of Iberian peninsula languages like Galician, Valencian, Spanish, and Catalan if we could improve results, and possibly deliver higher quality results for lower resource languages like Valencian and Galician.

I would imagine the architecture changes needed for this would be to provide a control token to signal which languages we would want for the translation result. Perhaps when decoding the sentence, the first token could be a [bosnian] or [croation] control token which would signal how to decode the rest of the sentence. It would be good to investigate Marian and other sources in the literature on how to accomplish this. A quick look at Marian docs doesn't have anything related.

gregtatum commented 1 week ago

Oh, and perhaps this could help solve #681 by having something like [sr-Cyrl] and [sr-Latn] as control tokens as well.