huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

english only for large-v2? #6

Open ostegm opened 8 months ago

ostegm commented 8 months ago

Wondering if the statement on the readme is correct "drop-in replacement for Whisper on English speech recognition" - does this mean even large-v2 model is english only? Thanks!

vvvm23 commented 8 months ago

Looking at the paper, the distilled version was only trained on English data. I am interested in evaluating the model on Mandarin Chinese data once it is released, to see how well it performs compared to the full model.

khmyznikov commented 8 months ago

Same. Interested in Serbian.

sanchit-gandhi commented 8 months ago

The first release of Distil-Whisper will be for English. We'll be releasing training code next week to facilitate anyone in the community to distill Whisper on their choice of language. In the mean-time, you can still run speculative decoding with the openai/whisper-tiny assistant to get a significant speed-up to inference: https://github.com/huggingface/distil-whisper#speculative-decoding Just swap out the assistant model id for the desired assistant model

vvvm23 commented 8 months ago

Fantastic~ Should we expect the speedup to be less for non-English audio on the English distilled model? Not familiar with the ins and outs of speculative decoding.

patrickvonplaten commented 8 months ago

Fantastic~ Should we expect the speedup to be less for non-English audio on the English distilled model? Not familiar with the ins and outs of speculative decoding.

That really depends on how many decoder layers you will distill the model to. If you can get away with just two decoder layers in other languages, then the speed-up will be the same!

vvvm23 commented 8 months ago

Hi, I meant is there any advantage to using your pretrained distilled model as an assistant model to the original large model on non-English inputs.

vvvm23 commented 8 months ago

Just tested this and seems no speedup, but that is expected given the difference in training distribution between the base and distilled. Might try my hand at distilling my own model, but not sure where to get good data from :sweat_smile:

Nice work all!

sanchit-gandhi commented 8 months ago

You can already try using one of the smaller pre-trained Whisper checkpoints as the assistant model to large-v2. The pre-trained multilingual Whisper models will have knowledge of the same languages as large-v2, so can be used as an assistant model. To do this, just swap out the assistant_model_id for the id of the model on the Hub, e.g. try using openai/whisper-tiny as the assistant model in this codesnippet. We got a 2x speed-up doing this for English ASR.

The Common Voice dataset is always a good starting point for finding multilingual ASR data!

vvvm23 commented 7 months ago

@sanchit-gandhi That's a good idea (to both points) actually. Thanks for the suggestions.

zhhao1 commented 6 months ago

The first release of Distil-Whisper will be for English. We'll be releasing training code next week to facilitate anyone in the community to distill Whisper on their choice of language. In the mean-time, you can still run speculative decoding with the openai/whisper-tiny assistant to get a significant speed-up to inference: https://github.com/huggingface/distil-whisper#speculative-decoding Just swap out the assistant model id for the desired assistant model

Hi, i replace the assistant model id with open/whisper-tiny. Then load it as follows. assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained( args.assistant_model, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=False ) assistant_model.to(device) However, i meet the following error. RuntimeError: Given groups=1, weight of size [384, 80, 3], expected input[1, 1, 1500] to have 80 channels, but got 1 channels instead