huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Compatibility with CTranslate2 / faster-whisper #3

Open entn-at opened 8 months ago

entn-at commented 8 months ago

Great work!

I was wondering whether the distilled version might still be compatible with CTranslate2 / faster-whisper? I understand the changes to the decoder might require some changes there, not to mention speculative decoding.

Thanks, Ewald

sanchit-gandhi commented 8 months ago

The weights will be released in Transformers format on the Hugging Face Hub tomorrow. It should be pretty straightforward to export them to faster-whisper format following these instructions: https://github.com/guillaumekln/faster-whisper/#model-conversion

I'll add them to the model repos once converted!

alexey-mik commented 8 months ago

Unfortunately, conversion to CTranslate2 format throws an error

ValueError: Some required model attributes are not set:

decoder/layer_2/self_attention/layer_norm/gamma
decoder/layer_2/self_attention/layer_norm/beta
decoder/layer_2/self_attention/linear_0/weight
...
AnkushMalaker commented 8 months ago

FYI: Related issue on faster-whisper to track full support https://github.com/guillaumekln/faster-whisper/issues/533

chiiyeh commented 8 months ago

Hi! I have done a PR on Ctranslate2 which will support the conversion for distil-whisper. Though for the word timing alignment it seems like openai hardcoded the specific cross attention head that are highly correlated with the word timing here. Not sure if there is similar one for distil-whisper.

patrickvonplaten commented 7 months ago

The cross attention head dimensions should be exactly the same as the corresponding teacher models (which are whisper-large-v2 for distil-whisper-32-2 and whisper-medium.en for distil-whisper-24-2)

chiiyeh commented 7 months ago

@patrickvonplaten unfortunately not all the cross attentions are highly correlated with word timing. Different cross attention might attend to different things. So what openai did was to find out specifically which of the cross attentions are correlated and only use this subset for the timing alignment. Currently there is some heuristic used (which is all the cross attention for the last half layers i think), but this should be less accurate then handpicking the subset. So the PR can work but expect more inaccuracy with the word level timing. If anyone is interested jongwook replied how he handpick the layer in this discussion here

sanchit-gandhi commented 7 months ago

Indeed, OpenAI hardcode these word-level timestamp alignment heads in their repo based on the cross-attention plots.

We haven't found the optimal alignment heads for word-level timestamps for Distil-Whisper, so these word-level timestamps aren't available yet.

Feel free to repeat the analysis from Jong Wook to see what the best configuration is here! We can then update the model's generation config accordingly to store this information. I'll also try and determine the best alignments from the validation sets in Distil-Whisper this week.

shuaijiang commented 7 months ago

Unfortunately, conversion to CTranslate2 format throws an error

ValueError: Some required model attributes are not set:

decoder/layer_2/self_attention/layer_norm/gamma
decoder/layer_2/self_attention/layer_norm/beta
decoder/layer_2/self_attention/linear_0/weight
...

upgrade ctranslate2 to 3.21.0