mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
155 stars 34 forks source link

Experiment with using more monolingual data #181

Closed eu9ene closed 4 months ago

eu9ene commented 1 year ago

The context is that for example English to Russian model is not great and sometimes it produces incorrect grammar. Adding more monolingual data for backtranslations can help to show the model more good examples in the target language.

eu9ene commented 1 year ago

We can also try new mono datasets from HPLT based on web crawl https://hplt-project.org/datasets/v1

eu9ene commented 1 year ago

More inspiration here https://www2.statmt.org/wmt23/translation-task.html, for example https://wortschatz.uni-leipzig.de/en/download/Russian

eu9ene commented 4 months ago

We're already doing this in our big training. We've been using HPLT and NLLB monolingual data where appropriate and we currently see an increase in quality for en-ru compared to our old models for example.