huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

How to set the target language for examples in README? #130

Open clstaudt opened 1 month ago

clstaudt commented 1 month ago

The code examples in the README do not make it obvious how to set the language of the audio to transcribe.

The default settings create garbled english text if the audio language is different.

CheshireCC commented 1 month ago

it seams that this model only output English subtitles.

clstaudt commented 1 month ago

@CheshireCC If that is the case, would it be a distilled version of Whisper?

"Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. "

https://openai.com/index/whisper

CheshireCC commented 1 month ago

@clstaudt

maybe distilled version requires re-training of the model,just like fine-tuning a model. https://github.com/huggingface/distil-whisper#:~:text=Note%3A%20Distil,checkpoints%20when%20ready!

"Note: Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill 
Whisper on other languages. If you are interested in distilling Whisper in your language, check out the provided training code. 
We will soon update the repository with multilingual checkpoints when ready!"
sanchit-gandhi commented 1 month ago

Indeed - as @CheshireCC has mentioned, you can train your own multilingual distil-whisper checkpoint according to the training readme. This has been done successfully in a number of languages, such as for French and German.

Also cc @eustlb having done some extensive experimentation into French distillation.

eustlb commented 1 month ago

Hey @clstaudt @CheshireCC, indeed distil-large-v3 has been trained to do English-only transcriptions. More details about motivations here.

clstaudt commented 1 month ago

Thanks for clarifying @eustlb. I'm about to give a presentation praising the potential of distillation with distil-whisper as the prime example. While the speedup is impressive, I think it's important to add that it's just one language while the teacher model was multilingual. What do you think will be the speedup and size reduction for a multilingual distil-whisper?

eustlb commented 1 month ago

Thanks for promoting distil-whisper, @clstaudt!

Actually, you can find the info about this here on the README and here on the model card, but thanks for mentioning it! It may not be clear enough.

Concerning the multilingual distilled Whisper, it is a very difficult question to answer without proper experimentation, and I prefer not to give false insights. There are a lot of factors to take into account (e.g., number of languages, dataset sizes, etc.). Yet, I would say that were you to have large enough datasets for a few languages and manage to get good results with a 4-layers decoder, the size reduction would be 48%—an exact value—(compared to 51% for a 2-layers decoder) and the speed-up should be around 5.5x—a rough estimation, to be taken with a big pinch of salt—(compared to 6.3x for a 2-layers decoder).