Open gregtatum opened 2 months ago
I filtered HPLT data with 0.8 and 0.9 scores after manual data inspection. I used 0.8 for distillation to have more data and 0.9 for back-translations assuming target sentences for back-translations should not include any noise so that the models don't learn to reproduce it.
It would be interesting to try this model for other monolingual data.
The HPLT dataset includes a fluency score. We should look at filtering our own data by this fluency metric, and see if it improves.
https://hplt-project.org/datasets/v1.2
I assume this would be useful for synthesizing back translations, and less useful for synthesizing distillation data.
https://aclanthology.org/2024.lrec-main.100.pdf