Filter monolingual data based on fluency scores

mozilla / translations

The code, training pipeline, and models that power Firefox Translations

https://mozilla.github.io/translations/

Mozilla Public License 2.0

154 stars 33 forks source link

Filter monolingual data based on fluency scores #789

Open gregtatum opened 2 months ago

gregtatum commented 2 months ago

The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.

https://hplt-project.org/datasets/v1.2

I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.

https://aclanthology.org/2024.lrec-main.100.pdf

fluency score, computed with a 7-gram modified Knesser-Ney character language model

eu9ene commented 2 months ago

I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.

It would be interesting to try this model for other monolingual data.