mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

[meta] Train easy to segment LTR languages #524

Open gregtatum opened 6 months ago

gregtatum commented 6 months ago

In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.

Data Availability Sentence Count
High Resource > 80 million
Med Resource 20 - 80 million
Low Resource < 20 million

Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:

High Quality Medium Quality Low Quality
Russian (en-ru) Vietnamese Norwegian (Bokmål)
Indonesian Slovak Basque
Czech (en-cs) Ukrainian (en-uk) Galician
Hungarian (en-hu) Slovenian (en-sl) Norwegian (Nynorsk)
Turkish (en-tr) Catalan (ready to ship)
Greek (en-el) Lithuanian
Finnish (en-fi) Croatian
Swedish Serbian
Romanian Latvian
Danish Valenciano
Bosnian

We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.

More links

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

gregtatum commented 5 months ago

For our upcoming training run, this table should summarize what monolingual data is available.

Name Difficulty To en From en Newscrawl
Russian ready to train Released Nightly yes
Indonesian ready to train     yes
Czech ready to train Nightly Nightly yes
Hungarian ready to train Released Nightly yes
Turkish ready to train     yes
Greek ready to train     yes
Finnish ready to train Released Nightly yes
Romanian ready to train     yes
Ukrainian medium resource Released Nightly yes
Lithuanian medium resource Nightly   yes
Croatian medium resource     yes
Serbian medium resource     yes
Latvian medium resource     yes
Bosnian ready to train     yes
Vietnamese medium resource     no
Swedish ready to train     no
Slovak medium resource     no
Danish ready to train     no
Slovenian medium resource     no
Valenciano medium resource     no
marco-c commented 5 months ago

Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.