mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Investigate improving en-lt student distillation by adding more data #772

Closed gregtatum closed 3 days ago

gregtatum commented 3 months ago

An experiment for #231. Investigation for #756.

English to Lithuanian (en-lt) was one of our worse models, and had a very large distillation gap. The teacher had a score of 0.8971 COMET, and the student had a score of 0.8642. There was a large gap of -0.0329.

Distillation only used newscrawl data, which is a single domain. This test will be to retrain the student with:

Metric Value
Teacher COMET 89.71
Data Sentences Student COMET Teacher Gap vs newscrawl*
newscrawl 380,607,008 86.42 -3.29
hplt, nllb 290,608,310 86.48 -3.23 +0.06
newscrawl, hplt, nllb 86.61 -3.10 +0.19

Hypothesis:

More diverse data for distillation will lower the gap. hplt, nllb will produce a better score than newscrawl. newscrawl, hplt, nllb will have the highest score at the cost of longer translate-mono-src and longer train-student.

Links:

Results:

The newscrawl, hplt, nllb mix was definitely the best score, but not much above the standard deviation of ±0.12 COMET. So it might improve to have all of it in the mix, but it's hard to say that it affected much. From this experiment I don't think it tells us much since we've determined the decoder is too small for Balto-Slavic languages like en-lt. I'm not sure it's worth running again, as it's more of a null result. I would say it doesn't harm things to have the hplt+nllb data, and it adds to data diversity, so it's fine to have in the mix, but not a key to a better model according to this experiment at least.

gregtatum commented 2 months ago

Running off of: #801

Training dashboard

task train -- --config configs/experiments-2024-H2/en-lt-experiments-2024-H2-hplt-nllb.yml
task train -- --config configs/experiments-2024-H2/en-lt-experiments-2024-H2.yml