Can we use distil whisper for 50+ concurrent requests on one T4 machine without compromising on latency for each request?

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.33k stars 238 forks source link

Can we use distil whisper for 50+ concurrent requests on one T4 machine without compromising on latency for each request? #47

Open moksh-samespace opened 7 months ago

sanchit-gandhi commented 7 months ago

For high batch sizes it is recommended to use newer hardware with more VRAM (e.g. an A100). The performance of T4 GPUs saturates quickly as you increase the VRAM, giving lower throughput at higher batch sizes. For details, see section D.5 of the Distil-Whisper paper (pages 29 and 30).