huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Question: should the pseudo-labelling model and teacher model be the same? #92

Closed guynich closed 2 months ago

guynich commented 3 months ago

If I want to use Medium.en model as teacher, would using another model such as Large_v3 for pseudo-labelling be suitable for the distil-whisper training methodology? Or should the same model always be used for both?

sanchit-gandhi commented 3 months ago

Ideally you would use the same for both. Since the KL loss is computed from the sequence of generated ids, we want the reference model in the KL loss (teacher during training) to be the same model used to generated the sequence of pseudo labels (the model during pseudo-labelling), to ensure we get the correct KL loss values.

Broadly speaking, it's always best to use the most performant model as the teacher, in order to maximise the performance of your student model. That means you should use large-v3 for both pseudo-labelling and distillation, to ensure you get the highest accuracy pseudo-labels, and thus maximise the accuracy of your student model.

guynich commented 3 months ago

Thank you for helpful comments. Makes sense. Closing.