huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

[Question] Can we distill for multiple langauges for distil-small-whisper #107

Open Killshot667 opened 2 months ago

Killshot667 commented 2 months ago

I have seen several distillations for different single languages for distil-whisper (like en, de etc). But I have yet to come across some distil-whisper which has been trained to be multilingual. For my use case, I need to distil it on multiple languages. But I couldnt find any results related to this in the paper. I wanted to know if such an experiment has been conducted before, atleast for two languages, and if there any results available for any such training. Does it give good results for both the languages, or does it fail to learn in such a case (maybe because of only two decoder layers). If it fails, could there be some other possible reason other than the model being too small to accomodate multiple languages

bil-ash commented 2 months ago

I too have the same question. @sanchit-gandhi Please try distilling whisper-small on kathbath dataset and share the results.

sanchit-gandhi commented 1 month ago

Hey @Killshot667 - that's a great question, and super sorry for the late reply here! I'll defer to @eustlb, who has been running some preliminary experiments on distilling Whisper jointly for French and Spanish. You can read about the initial results and how to reproduce them on the README here: https://github.com/huggingface/distil-whisper/tree/main/training#3-language-mixing

eustlb commented 1 month ago

Hey @Killshot667! Thanks for raising this interesting point. Indeed, distillation has, for the moment, been targeted at single languages.

For distillation, the approach was initially to shrink the model as much as possible while maximizing its performance by training a smaller decoder on a targeted language. The idea is to trade the multilingual capacities of the 32 layers of the decoder for size and speed improvement brought by a smaller decoder (therefore with smaller learning capacities). In this context, two layers appeared to be Pareto optimal. Were we to train on a multilingual dataset, more decoder layers might be needed to enhance learning capacities. Such an adaptation of the student model’s decoder layers can be easily done by changing --decoder_layers when initializing.

Secondly, note there is nothing restraining a distilled model from having multilingual transcription capacities. First, the encoder is identical to Whisper’s, so robustness in creating a representation of speech for different languages remains unchanged. Secondly, when initializing the student model, we keep Whisper’s vocabulary and start from Whisper input embeddings, coming with inherent multilingual tokens. To this extent, the only thing restraining distil-large-v3 from being multilingual is the dataset it has been distilled on. You could perfectly train, for example, a 4-decoder-layer distilled model on European languages (easily done by pseudo-labeling each set with the correct --language flag as explained in language-mixing). Actually, language-mixing experiments showed that mixing close languages could improve the model’s performance.