huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

High inference time when using chunk size 15 #63

Open shashikg opened 6 months ago

shashikg commented 6 months ago

Hi @sanchit-gandhi !

I'm in the process of integrating multiple whisper backends into a unified package that includes VAD-based chunking. During testing, I observed significantly higher inference times while using the HuggingFace pipeline with distil-whisper. You can find the details here: https://github.com/shashikg/WhisperS2T/releases/tag/v1.1.0 [A30 GPU]

Could you please review the benchmarking script I'm using? It's available at: https://github.com/shashikg/WhisperS2T/blob/main/scripts/benchmark_huggingface_distil.py

Thanks for your assistance!

Shashi

sanchit-gandhi commented 6 months ago

Hey @shashikg! Thanks for sharing these benchmarks! I've had a look through the code, there were two variables that we could maybe adjust:

  1. num_workers: is there any reason we pin this to 1 data loader num worker here? We could pre-process our data faster if we left this as the default (8)
  2. chunk_length_s: worth setting this to 15 in all instances, e.g. here
shashikg commented 6 months ago
  1. Hey I think HF ChunkPipeline sets anything greater than num_worker>0 to num_worker=1. See here. Though I will once run the benchmark after setting this to a higher number.
  2. That should not be an issue, for distil-whisper I only ran the benchmark on KINCAID WAV. this