mozilla / translations

The code, training pipeline, and models that power Firefox Translations
https://mozilla.github.io/translations/
Mozilla Public License 2.0
154 stars 33 forks source link

Investigate removing teacher ensemble training #778

Open gregtatum opened 3 months ago

gregtatum commented 3 months ago

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it.

Comet Change Average Type
+00.15 Mean
+00.14 Median

Spreadsheet

For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.

It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.