Closed eu9ene closed 4 months ago
We can also try new mono datasets from HPLT based on web crawl https://hplt-project.org/datasets/v1
More inspiration here https://www2.statmt.org/wmt23/translation-task.html, for example https://wortschatz.uni-leipzig.de/en/download/Russian
We're already doing this in our big training. We've been using HPLT and NLLB monolingual data where appropriate and we currently see an increase in quality for en-ru compared to our old models for example.
The context is that for example English to Russian model is not great and sometimes it produces incorrect grammar. Adding more monolingual data for backtranslations can help to show the model more good examples in the target language.