huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.54k stars 280 forks source link

Tiny model? #14

Open soupslurpr opened 11 months ago

soupslurpr commented 11 months ago

Hi, are there any plans to train a tiny distilled Whisper model? It would be very interesting to see how fast it would go, as I'd like to use it in phones.

sanchit-gandhi commented 11 months ago

Thanks for your interest! We'll start with the small model and work our way down!

soupslurpr commented 11 months ago

Cool thanks, looking forward to it and seeing the results!

sanchit-gandhi commented 11 months ago

Feel free to follow along progress! https://wandb.ai/sanchit-gandhi/distil-whisper?workspace=user-sanchit-gandhi

sanchit-gandhi commented 10 months ago

Still ongoing - had some difficulties streaming data from the HF Hub the past week. We're training a 2-layer and 4-layer variant of the small model! Will then move onto the base model

sanchit-gandhi commented 9 months ago

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

soupslurpr commented 9 months ago

That's great, its still faster than the normal small.en right? Also will distilling base.en and/or tiny.en be tried still or no? Thanks.

mitchelldehaven commented 9 months ago

distil-small.en is released here: https://huggingface.co/distil-whisper/distil-small.en

It's quite hard to compress further than this without loosing WER performance: https://huggingface.co/distil-whisper/distil-small.en#why-is-distil-smallen-slower-than-distil-large-v2

Is there any way we can access the small 2-layer decoder variant?

hidoba commented 9 months ago

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

sanchit-gandhi commented 8 months ago

Yep - distil-small.en is about 2x faster than small.en on short-form evaluation. I personally won't try distilling base.en or tiny.en, since it's quite hard to retain performance for these smaller models, but would encourage you to try by leveraging the training code if this is of interest to you!

sanchit-gandhi commented 8 months ago

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

sanchit-gandhi commented 8 months ago

Does it make sense to have a 2-layer decoder version for speculative decoding in combination with distil-large?

I would think not: we want out main model to be as accurate as possible for speculative decoding to get the lowest possible WER results (i.e. use the teacher Whisper large-v2 model). It doesn't really matter how fast the main model is, since we only do validation forward passes with it. The auto-regressive bottle neck is handled by the assistant model, so there's little gain from using a faster main model. So we should pick the most accurate main model, and an assistant model that is much faster and predicts the correct token ids 70-80% of the time.

mitchelldehaven commented 8 months ago

Is there any way we can access the small 2-layer decoder variant?

Yes, c.f. https://huggingface.co/distil-whisper/distil-small.en

@sanchit-gandhi From https://huggingface.co/distil-whisper/distil-small.en:

While distil-medium.en and distil-large-v2 use two decoder layers each, distil-small.en uses four.

It sounds like the version there is the 4-layer version. Am I missing a way to get the 2-layer version from that?