Open gregtatum opened 7 months ago
I've found some resources I believe could be useful for training Hebrew models.
At the moment, the best language pair datasets available for Hebrew are the large multilingual ones available on OPUS:
Here are a few more not on OPUS that might be worth checking as well:
Some extra resources:
RTL languages shouldn't affect training, but doing so will require some work on the Firefox side. This meta bug tracks any work that is needed. We should complete a subset of the easier to segment LTR languages in #524 first as they do not require Firefox changes. These will require a bit more work.
There might be some tokenization/segmentation work around Arabic as well.
Native Speakers
If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.