Investigate removing teacher ensemble training

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it.

Comet Change	Average Type
+00.15	Mean
+00.14	Median

Spreadsheet

For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.

It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.

mozilla / translations

Investigate removing teacher ensemble training #778