mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

[meta] Train RTL languages like Arabic and Hebrew #525

Open gregtatum opened 7 months ago

gregtatum commented 7 months ago

RTL languages shouldn't affect training, but doing so will require some work on the Firefox side. This meta bug tracks any work that is needed. We should complete a subset of the easier to segment LTR languages in #524 first as they do not require Firefox changes. These will require a bit more work.

### Tasks
- [ ] [Bug 1876099 - Audit and fix any issues around translating bidirectional language pairs (RTL to LTR, or LTR to RTL)](https://bugzilla.mozilla.org/show_bug.cgi?id=1876099)

There might be some tokenization/segmentation work around Arabic as well.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

BynariStar commented 2 weeks ago

I've found some resources I believe could be useful for training Hebrew models.

At the moment, the best language pair datasets available for Hebrew are the large multilingual ones available on OPUS:

Here are a few more not on OPUS that might be worth checking as well:

Some extra resources: