Limit the amount of data used for distillation

gregtatum commented 3 weeks ago

In #771 I ran an experiment to see the effects of the size of the distillation corpus for the change in the COMET score for the students. Adding more data to this step did not affect the COMET score beyond the standard deviation (±0.12 COMET) of training student models.

Synthesizing the training pairs from the monolingual data is one of the more expensive parts of the pipeline, so we should limit the amount of data we throw at it.

For this work we need to:

Determine the threshold that we cut off.
Determine how we mix the source part of the parallel corpus, and the source monolingual data.

1. Threshold cut-off

In our 1:1 @eu9ene proposed 50 million, which feels like a reasonable initial threshold to me. He mentioned that we shouldn't 100% rely on the evaluation metrics since more data diversity could create a better general translation model for translating the web. There is a risk that our evaluation data is not diverse enough to capture this, so we should be conservative in how much we cut off.

I think we can probably go even lower if we wanted, as the results were the same for 30M in da-en. I have an experiment still running with 1M and 10,000 to further test the limits here.

We should verify that these results still hold for a Balto-Slavic language, like en-lt.

2. How to mix

I'm not sure how we want to mix our data or if @eu9ene has thoughts here. We could collect all of our source parallel data and all of the monolingual available, and then mix and truncate it. This is what I was doing in my experiment.

It's likely that we'll have more parallel source data than the 50 million cut-off for many languages.

ZJaume commented 3 weeks ago

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

eu9ene commented 3 weeks ago

It may be that the effect of adding syntheticly translated monolingual data is more noticeable if the language pair is low/mid-resource. Backtranslation usually has a big impact in low-resource.

We currently don't use back-translations for distillation.

eu9ene commented 3 weeks ago

I think when we train a "tiny" student model it has limited complexity and adding more data doesn't help at some point. So the model kind of underfits. When we increase the model size to the "base" architecture it will be a completely different picture. Also, I'm pretty sure it's different for each language.

Just to clarify, if I understand correctly @gregtatum is talking about limiting the whole mix of original corpus + mono data, not only the mono part.

We can't run such an experiment for each language and config. With all that said I'd rather oversupply the data because undersupplying it risks losing quality without knowing it both on evaluation benchmarks and in the wild. I'd rather show the model more diverse data than train it in a loop of multiple epochs on the same data. I'd say 50M sounds like too low to me. Maybe 200M or so would be safer. Just a guess.

From a cost-efficiency perspective, I recommend focusing on #453 first. There's a lot of GPU underutilizing etc. there.

mozilla / translations

Limit the amount of data used for distillation #905

1. Threshold cut-off

2. How to mix