We should investigate shipping a multilingual model for other languages like this. Valencian and Catalan are both fairly similar and could use the same trick. I'd be curious if we combined a bunch of Iberian peninsula languages like Galician, Valencian, Spanish, and Catalan if we could improve results, and possibly deliver higher quality results for lower resource languages like Valencian and Galician.
I would imagine the architecture changes needed for this would be to provide a control token to signal which languages we would want for the translation result. Perhaps when decoding the sentence, the first token could be a [bosnian] or [croation] control token which would signal how to decode the rest of the sentence. It would be good to investigate Marian and other sources in the literature on how to accomplish this. A quick look at Marian docs doesn't have anything related.
Bosnian-Croatian-Montenegrin-Serbian are a language group that are very similar, and mutually intelligible. Per Wikipedia:
We should investigate shipping a multilingual model for other languages like this. Valencian and Catalan are both fairly similar and could use the same trick. I'd be curious if we combined a bunch of Iberian peninsula languages like Galician, Valencian, Spanish, and Catalan if we could improve results, and possibly deliver higher quality results for lower resource languages like Valencian and Galician.
I would imagine the architecture changes needed for this would be to provide a control token to signal which languages we would want for the translation result. Perhaps when decoding the sentence, the first token could be a
[bosnian]
or[croation]
control token which would signal how to decode the rest of the sentence. It would be good to investigate Marian and other sources in the literature on how to accomplish this. A quick look at Marian docs doesn't have anything related.