mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Figure out the behavior of OpusTrainer augmentation on student distillation gap #773

Closed gregtatum closed 3 weeks ago

gregtatum commented 3 months ago

An experiment for #231

We use OpusTrainer to augment the data during training, both at teacher training, and student training. There is a gap in student training, and it would be good to understand the effects of data augmentation on the training. It's particularly important to compare between the augmented and clean flores.

Language pair: TODO

Experiment Splits

Strategies flores-devtest flores-aug-devtest Training Time
Run augmentation
Disable augmentation
Two stage: aug, no aug
Two stage: no aug, aug

Hypothesis

Augmentation increases student training time. Augmentation behavior may not be learned without augmented data.

gregtatum commented 3 months ago

We'll do #771 and #772 first.

eu9ene commented 4 weeks ago

I'm going to run a quick experiment for en-ru from the same branch to see the side-by-side effect of disabled augmentation. We don't have to wait until the full training, a couple of days we'll be enough to see the difference in validation curves.

eu9ene commented 3 weeks ago

I completed training for en-ru student with no augmentation and evals on non-augmented dataset are similar and worse on the augmented ones, so we can conclude that data augmentation doesn't affect distillation quality gap.

No augmentation student and evals: https://firefox-ci-tc.services.mozilla.com/tasks/groups/FQ0mxIvFSMiLakX3uxk0uA Student with augmentation: https://firefox-ci-tc.services.mozilla.com/tasks/groups/CbqKRgg6QKuoGWa8n634Eg

Strategies flores-devtest flores-aug-mix flores-aug-upper Training Time
Run augmentation 0.8562 0.8493 0.7942 15 days
Disable augmentation 0.855 0.7885 0.3944 5 days
Screenshot 2024-10-09 at 12 58 17 PM