[meta] Train RTL languages like Arabic and Hebrew

mozilla / translations

The code, training pipeline, and models that power Firefox Translations

Mozilla Public License 2.0

155 stars 34 forks source link

RTL languages shouldn't affect training, but doing so will require some work on the Firefox side. This meta bug tracks any work that is needed. We should complete a subset of the easier to segment LTR languages in #524 first as they do not require Firefox changes. These will require a bit more work.

### Tasks
- [ ] [Bug 1876099 - Audit and fix any issues around translating bidirectional language pairs (RTL to LTR, or LTR to RTL)](https://bugzilla.mozilla.org/show_bug.cgi?id=1876099)

There might be some tokenization/segmentation work around Arabic as well.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

I've found some resources I believe could be useful for training Hebrew models.

At the moment, the best language pair datasets available for Hebrew are the large multilingual ones available on OPUS:

NLLB
XLEnt

Here are a few more not on OPUS that might be worth checking as well:

HebNLI - Manually verified machine translation
HebWiki QA - Manually verified machine translation
word2word - A simple word-to-word pair dataset
Hebrew WordNet (Archived) - The website seems to be down. I will try to see if they have a backup.

Some extra resources:

HebSpacy - An NER model used by Azure AI Language for TA4H in Hebrew. Made in collaboration between Microsoft, the Israeli Ministry of Health, an Israeli HMO, and others.
The ONLP Lab - An NLP research lab at the Bar Ilan University in Israel. They have a lot of Hebrew NLP resources and models.
Hebrew NLP Resources - A list of Hebrew NLP resources compiled by NNLP, an Israeli government Hebrew NLP advancement project.

mozilla / translations

[meta] Train RTL languages like Arabic and Hebrew #525

Native Speakers