Closed gregtatum closed 3 weeks ago
We'll do #771 and #772 first.
I'm going to run a quick experiment for en-ru from the same branch to see the side-by-side effect of disabled augmentation. We don't have to wait until the full training, a couple of days we'll be enough to see the difference in validation curves.
I completed training for en-ru student with no augmentation and evals on non-augmented dataset are similar and worse on the augmented ones, so we can conclude that data augmentation doesn't affect distillation quality gap.
No augmentation student and evals: https://firefox-ci-tc.services.mozilla.com/tasks/groups/FQ0mxIvFSMiLakX3uxk0uA Student with augmentation: https://firefox-ci-tc.services.mozilla.com/tasks/groups/CbqKRgg6QKuoGWa8n634Eg
Strategies | flores-devtest | flores-aug-mix | flores-aug-upper | Training Time |
---|---|---|---|---|
Run augmentation | 0.8562 | 0.8493 | 0.7942 | 15 days |
Disable augmentation | 0.855 | 0.7885 | 0.3944 | 5 days |
An experiment for #231
We use OpusTrainer to augment the data during training, both at teacher training, and student training. There is a gap in student training, and it would be good to understand the effects of data augmentation on the training. It's particularly important to compare between the augmented and clean flores.
Language pair: TODO
Experiment Splits
Hypothesis
Augmentation increases student training time. Augmentation behavior may not be learned without augmented data.