Open ostegm opened 8 months ago
Looking at the paper, the distilled version was only trained on English data. I am interested in evaluating the model on Mandarin Chinese data once it is released, to see how well it performs compared to the full model.
Same. Interested in Serbian.
The first release of Distil-Whisper will be for English. We'll be releasing training code next week to facilitate anyone in the community to distill Whisper on their choice of language. In the mean-time, you can still run speculative decoding with the openai/whisper-tiny
assistant to get a significant speed-up to inference: https://github.com/huggingface/distil-whisper#speculative-decoding Just swap out the assistant model id for the desired assistant model
Fantastic~ Should we expect the speedup to be less for non-English audio on the English distilled model? Not familiar with the ins and outs of speculative decoding.
Fantastic~ Should we expect the speedup to be less for non-English audio on the English distilled model? Not familiar with the ins and outs of speculative decoding.
That really depends on how many decoder layers you will distill the model to. If you can get away with just two decoder layers in other languages, then the speed-up will be the same!
Hi, I meant is there any advantage to using your pretrained distilled model as an assistant model to the original large model on non-English inputs.
Just tested this and seems no speedup, but that is expected given the difference in training distribution between the base and distilled. Might try my hand at distilling my own model, but not sure where to get good data from :sweat_smile:
Nice work all!
You can already try using one of the smaller pre-trained Whisper checkpoints as the assistant model to large-v2
. The pre-trained multilingual Whisper models will have knowledge of the same languages as large-v2
, so can be used as an assistant model. To do this, just swap out the assistant_model_id
for the id of the model on the Hub, e.g. try using openai/whisper-tiny
as the assistant model in this codesnippet. We got a 2x speed-up doing this for English ASR.
The Common Voice dataset is always a good starting point for finding multilingual ASR data!
@sanchit-gandhi That's a good idea (to both points) actually. Thanks for the suggestions.
The first release of Distil-Whisper will be for English. We'll be releasing training code next week to facilitate anyone in the community to distill Whisper on their choice of language. In the mean-time, you can still run speculative decoding with the
openai/whisper-tiny
assistant to get a significant speed-up to inference: https://github.com/huggingface/distil-whisper#speculative-decoding Just swap out the assistant model id for the desired assistant model
Hi, i replace the assistant model id with open/whisper-tiny. Then load it as follows.
assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained( args.assistant_model, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=False )
assistant_model.to(device)
However, i meet the following error.
RuntimeError: Given groups=1, weight of size [384, 80, 3], expected input[1, 1, 1500] to have 80 channels, but got 1 channels instead
Wondering if the statement on the readme is correct "drop-in replacement for Whisper on English speech recognition" - does this mean even large-v2 model is english only? Thanks!