Filter monolingual data based on fluency scores

mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models

https://mozilla.github.io/firefox-translations-training/

Mozilla Public License 2.0

143 stars 31 forks source link

Filter monolingual data based on fluency scores #789

Open gregtatum opened 1 month ago

gregtatum commented 1 month ago

The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.

https://hplt-project.org/datasets/v1.2

I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.

https://aclanthology.org/2024.lrec-main.100.pdf

fluency score, computed with a 7-gram modified Knesser-Ney character language model

eu9ene commented 1 month ago

I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.

It would be interesting to try this model for other monolingual data.