huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Could you add Required VRAM info in the README? #32

Closed treya-lin closed 7 months ago

treya-lin commented 7 months ago

Hi, it would be of great help if the info of required VRAM is added in the README, just as in whisper's repo where they list the vram needed for different sizes of models (https://github.com/openai/whisper#available-models-and-languages) Thanks!

sanchit-gandhi commented 7 months ago

Hey @treya-lin - we did indeed consider the VRAM for these models, but found it to be highly dependent on GPU hardware, CUDA version and even PyTorch version. For example, VRAM changed considerably going between PyTorch 1.13 and 2.0 for the same models on the same hardware. Therefore, we decided to quote the parameter count as a "proxy" for VRAM usage, in order to give a fair and reliable estimate for the expected memory.

To give you some idea of VRAM, here are very preliminary results I got benchmarking randomly initialised teacher/student models on a 16GB T4 GPU with PT 1.13 with no Flash Attention. Here, I measured the time taken to generate 25 tokens with a batch size of 1, and then averaged over 100 examples. Feel free to use this as indicative numbers for VRAM, but I would highly advise that you measure the Whisper/Distil-Whisper models on your own hardware/library versions!

chart-2

treya-lin commented 7 months ago

Hey @treya-lin - we did indeed consider the VRAM for these models, but found it to be highly dependent on GPU hardware, CUDA version and even PyTorch version. For example, VRAM changed considerably going between PyTorch 1.13 and 2.0 for the same models on the same hardware. Therefore, we decided to quote the parameter count as a "proxy" for VRAM usage, in order to give a fair and reliable estimate for the expected memory.

To give you some idea of VRAM, here are very preliminary results I got benchmarking randomly initialised teacher/student models on a 16GB T4 GPU with PT 1.13 with no Flash Attention. Here, I measured the time taken to generate 25 tokens with a batch size of 1, and then averaged over 100 examples. Feel free to use this as indicative numbers for VRAM, but I would highly advise that you measure the Whisper/Distil-Whisper models on your own hardware/library versions!

chart-2

Hi thanks for your reply. Very helpful information, thanks! I will take a look at how it works in my environment.