Investigate multilingual models for similar language groups

Bosnian-Croatian-Montenegrin-Serbian are a language group that are very similar, and mutually intelligible. Per Wikipedia:

It is a pluricentric language with four[16] mutually intelligible standard varieties, namely Serbian, Croatian, Bosnian, and Montenegrin

We should investigate shipping a multilingual model for other languages like this. Valencian and Catalan are both fairly similar and could use the same trick. I'd be curious if we combined a bunch of Iberian peninsula languages like Galician, Valencian, Spanish, and Catalan if we could improve results, and possibly deliver higher quality results for lower resource languages like Valencian and Galician.

I would imagine the architecture changes needed for this would be to provide a control token to signal which languages we would want for the translation result. Perhaps when decoding the sentence, the first token could be a [bosnian] or [croation] control token which would signal how to decode the rest of the sentence. It would be good to investigate Marian and other sources in the literature on how to accomplish this. A quick look at Marian docs doesn't have anything related.

mozilla / firefox-translations-training

Investigate multilingual models for similar language groups #684