Support multilingual models

It can be useful for groups of similar languages like Bosnian-Croatian-Montenegrin-Serbian (BCMS) (bs, hr, cnr, sr) which are usually trained jointly to have more data available.

Required pipeline modifications:

using an alias instead of a language pair and listing the languages in the config
gathering OPUS datasets for each language
tricks with cleaning: utilizing bicleaner model https://huggingface.co/bitextor/bicleaner-ai-full-en-hbs
tricks with evals as there are different evaluation datasets available for each language on Flores

A somewhat workable workaround is to train one language and add data in other languages as custom datasets on GCP, but it skips language-specific evals and the targeted bicleaner model. The latter is likely fine since the new multilingual bicleaner model is supposed to be very strong.

mozilla / translations

Support multilingual models #726