mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Support multilingual models #726

Closed eu9ene closed 3 months ago

eu9ene commented 3 months ago

It can be useful for groups of similar languages like Bosnian-Croatian-Montenegrin-Serbian (BCMS) (bs, hr, cnr, sr) which are usually trained jointly to have more data available.

Required pipeline modifications:

A somewhat workable workaround is to train one language and add data in other languages as custom datasets on GCP, but it skips language-specific evals and the targeted bicleaner model. The latter is likely fine since the new multilingual bicleaner model is supposed to be very strong.

eu9ene commented 3 months ago

Actually dupe of #684