Open entn-at opened 8 months ago
The weights will be released in Transformers format on the Hugging Face Hub tomorrow. It should be pretty straightforward to export them to faster-whisper format following these instructions: https://github.com/guillaumekln/faster-whisper/#model-conversion
I'll add them to the model repos once converted!
Unfortunately, conversion to CTranslate2 format throws an error
ValueError: Some required model attributes are not set:
decoder/layer_2/self_attention/layer_norm/gamma
decoder/layer_2/self_attention/layer_norm/beta
decoder/layer_2/self_attention/linear_0/weight
...
FYI: Related issue on faster-whisper to track full support https://github.com/guillaumekln/faster-whisper/issues/533
The cross attention head dimensions should be exactly the same as the corresponding teacher models (which are whisper-large-v2 for distil-whisper-32-2 and whisper-medium.en for distil-whisper-24-2)
@patrickvonplaten unfortunately not all the cross attentions are highly correlated with word timing. Different cross attention might attend to different things. So what openai did was to find out specifically which of the cross attentions are correlated and only use this subset for the timing alignment. Currently there is some heuristic used (which is all the cross attention for the last half layers i think), but this should be less accurate then handpicking the subset. So the PR can work but expect more inaccuracy with the word level timing. If anyone is interested jongwook replied how he handpick the layer in this discussion here
Indeed, OpenAI hardcode these word-level timestamp alignment heads in their repo based on the cross-attention plots.
We haven't found the optimal alignment heads for word-level timestamps for Distil-Whisper, so these word-level timestamps aren't available yet.
Feel free to repeat the analysis from Jong Wook to see what the best configuration is here! We can then update the model's generation config accordingly to store this information. I'll also try and determine the best alignments from the validation sets in Distil-Whisper this week.
Unfortunately, conversion to CTranslate2 format throws an error
ValueError: Some required model attributes are not set: decoder/layer_2/self_attention/layer_norm/gamma decoder/layer_2/self_attention/layer_norm/beta decoder/layer_2/self_attention/linear_0/weight ...
upgrade ctranslate2 to 3.21.0
Great work!
I was wondering whether the distilled version might still be compatible with CTranslate2 / faster-whisper? I understand the changes to the decoder might require some changes there, not to mention speculative decoding.
Thanks, Ewald