It can be useful for groups of similar languages like Bosnian-Croatian-Montenegrin-Serbian (BCMS) (bs, hr, cnr, sr) which are usually trained jointly to have more data available.
Required pipeline modifications:
using an alias instead of a language pair and listing the languages in the config
tricks with evals as there are different evaluation datasets available for each language on Flores
A somewhat workable workaround is to train one language and add data in other languages as custom datasets on GCP, but it skips language-specific evals and the targeted bicleaner model. The latter is likely fine since the new multilingual bicleaner model is supposed to be very strong.
It can be useful for groups of similar languages like Bosnian-Croatian-Montenegrin-Serbian (BCMS) (bs, hr, cnr, sr) which are usually trained jointly to have more data available.
Required pipeline modifications:
A somewhat workable workaround is to train one language and add data in other languages as custom datasets on GCP, but it skips language-specific evals and the targeted bicleaner model. The latter is likely fine since the new multilingual bicleaner model is supposed to be very strong.