Open gregtatum opened 6 months ago
For our upcoming training run, this table should summarize what monolingual data is available.
Name | Difficulty | To en |
From en |
Newscrawl |
---|---|---|---|---|
Russian | ready to train | Released | Nightly | yes |
Indonesian | ready to train | yes | ||
Czech | ready to train | Nightly | Nightly | yes |
Hungarian | ready to train | Released | Nightly | yes |
Turkish | ready to train | yes | ||
Greek | ready to train | yes | ||
Finnish | ready to train | Released | Nightly | yes |
Romanian | ready to train | yes | ||
Ukrainian | medium resource | Released | Nightly | yes |
Lithuanian | medium resource | Nightly | yes | |
Croatian | medium resource | yes | ||
Serbian | medium resource | yes | ||
Latvian | medium resource | yes | ||
Bosnian | ready to train | yes | ||
Vietnamese | medium resource | no | ||
Swedish | ready to train | no | ||
Slovak | medium resource | no | ||
Danish | ready to train | no | ||
Slovenian | medium resource | no | ||
Valenciano | medium resource | no |
Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.
In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.
Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:
We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.
More links
Native Speakers
If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.