Closed guynich closed 2 months ago
Ideally you would use the same for both. Since the KL loss is computed from the sequence of generated ids, we want the reference model in the KL loss (teacher during training) to be the same model used to generated the sequence of pseudo labels (the model during pseudo-labelling), to ensure we get the correct KL loss values.
Broadly speaking, it's always best to use the most performant model as the teacher, in order to maximise the performance of your student model. That means you should use large-v3 for both pseudo-labelling and distillation, to ensure you get the highest accuracy pseudo-labels, and thus maximise the accuracy of your student model.
Thank you for helpful comments. Makes sense. Closing.
If I want to use
Medium.en
model as teacher, would using another model such asLarge_v3
for pseudo-labelling be suitable for thedistil-whisper
training methodology? Or should the same model always be used for both?