Investigate improving en-lt student distillation by adding more data

An experiment for #231. Investigation for #756.

English to Lithuanian (en-lt) was one of our worse models, and had a very large distillation gap. The teacher had a score of 0.8971 COMET, and the student had a score of 0.8642. There was a large gap of -0.0329.

Distillation only used newscrawl data, which is a single domain. This test will be to retrain the student with:

Metric	Value
Teacher COMET	89.71

Data	Sentences	Student COMET	Teacher Gap	vs newscrawl*
newscrawl	380,607,008	86.42	-3.29
hplt, nllb	290,608,310	86.48	-3.23	+0.06
newscrawl, hplt, nllb		86.61	-3.10	+0.19

Hypothesis:

More diverse data for distillation will lower the gap. hplt, nllb will produce a better score than newscrawl. newscrawl, hplt, nllb will have the highest score at the cost of longer translate-mono-src and longer train-student.

Results:

The newscrawl, hplt, nllb mix was definitely the best score, but not much above the standard deviation of ±0.12 COMET. So it might improve to have all of it in the mix, but it's hard to say that it affected much. From this experiment I don't think it tells us much since we've determined the decoder is too small for Balto-Slavic languages like en-lt. I'm not sure it's worth running again, as it's more of a null result. I would say it doesn't harm things to have the hplt+nllb data, and it adds to data diversity, so it's fine to have in the mix, but not a key to a better model according to this experiment at least.

mozilla / translations

Investigate improving en-lt student distillation by adding more data #772

Hypothesis:

Links:

Results: