English to Lithuanian (en-lt) was one of our worse models, and had a very large distillation gap. The teacher had a score of 0.8971 COMET, and the student had a score of 0.8642. There was a large gap of -0.0329.
Distillation only used newscrawl data, which is a single domain. This test will be to retrain the student with:
Metric
Value
Teacher COMET
89.71
Data
Sentences
Student COMET
Teacher Gap
vs newscrawl*
newscrawl
380,607,008
86.42
-3.29
hplt, nllb
290,608,310
86.48
-3.23
+0.06
newscrawl, hplt, nllb
86.61
-3.10
+0.19
Hypothesis:
More diverse data for distillation will lower the gap. hplt, nllb will produce a better score than newscrawl. newscrawl, hplt, nllb will have the highest score at the cost of longer translate-mono-src and longer train-student.
The newscrawl, hplt, nllb mix was definitely the best score, but not much above the standard deviation of ±0.12 COMET. So it might improve to have all of it in the mix, but it's hard to say that it affected much. From this experiment I don't think it tells us much since we've determined the decoder is too small for Balto-Slavic languages like en-lt. I'm not sure it's worth running again, as it's more of a null result. I would say it doesn't harm things to have the hplt+nllb data, and it adds to data diversity, so it's fine to have in the mix, but not a key to a better model according to this experiment at least.
An experiment for #231. Investigation for #756.
English to Lithuanian (en-lt) was one of our worse models, and had a very large distillation gap. The teacher had a score of 0.8971 COMET, and the student had a score of 0.8642. There was a large gap of -0.0329.
Distillation only used newscrawl data, which is a single domain. This test will be to retrain the student with:
Hypothesis:
More diverse data for distillation will lower the gap.
hplt, nllb
will produce a better score thannewscrawl
.newscrawl, hplt, nllb
will have the highest score at the cost of longertranslate-mono-src
and longertrain-student
.Links:
Results:
The
newscrawl, hplt, nllb
mix was definitely the best score, but not much above the standard deviation of ±0.12 COMET. So it might improve to have all of it in the mix, but it's hard to say that it affected much. From this experiment I don't think it tells us much since we've determined the decoder is too small for Balto-Slavic languages like en-lt. I'm not sure it's worth running again, as it's more of a null result. I would say it doesn't harm things to have thehplt+nllb
data, and it adds to data diversity, so it's fine to have in the mix, but not a key to a better model according to this experiment at least.