For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.
It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.
Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it.
Spreadsheet
For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.
It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.