[meta] Train easy to segment LTR languages

gregtatum commented 6 months ago

In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.

Data Availability	Sentence Count
High Resource	> 80 million
Med Resource	20 - 80 million
Low Resource	< 20 million

Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:

High Quality	Medium Quality	Low Quality
Russian (en-ru)	Vietnamese	Norwegian (Bokmål)
Indonesian	Slovak	Basque
Czech (en-cs)	Ukrainian (en-uk)	Galician
Hungarian (en-hu)	Slovenian (en-sl)	Norwegian (Nynorsk)
Turkish (en-tr)	Catalan (ready to ship)
Greek (en-el)	Lithuanian
Finnish (en-fi)	Croatian
Swedish	Serbian
Romanian	Latvian
Danish	Valenciano
Bosnian

We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

gregtatum commented 5 months ago

For our upcoming training run, this table should summarize what monolingual data is available.

Name	Difficulty	To `en`	From `en`	Newscrawl
Russian	ready to train	Released	Nightly	yes
Indonesian	ready to train			yes
Czech	ready to train	Nightly	Nightly	yes
Hungarian	ready to train	Released	Nightly	yes
Turkish	ready to train			yes
Greek	ready to train			yes
Finnish	ready to train	Released	Nightly	yes
Romanian	ready to train			yes
Ukrainian	medium resource	Released	Nightly	yes
Lithuanian	medium resource	Nightly		yes
Croatian	medium resource			yes
Serbian	medium resource			yes
Latvian	medium resource			yes
Bosnian	ready to train			yes
Vietnamese	medium resource			no
Swedish	ready to train			no
Slovak	medium resource			no
Danish	ready to train			no
Slovenian	medium resource			no
Valenciano	medium resource			no

marco-c commented 5 months ago

Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.

mozilla / translations

[meta] Train easy to segment LTR languages #524

More links

Native Speakers