huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.22k stars 26.84k forks source link

Improve batching speed for Whisper models using VAD based chunking #34463

Open Jiltseb opened 2 days ago

Jiltseb commented 2 days ago

Feature request

This feature request aims to improve the speed of Whisper's batched version by adding a VAD model (such as pyannote or from NeMO or Silero) and merging chunks up to 30 sec, instead of relying on the Sliding window technique that takes up more time.

Motivation

In addition to the benefits stated above, the semantic chunks take care of the transcription at the 30 sec boundaries.

Your contribution

A similar batching has already been implemented in faster-whisper project with Silero VAD. The WER and Speed on internal test data and youtube-commons-asr-eval (for long-form transcription) are better on this implementation than the current HF implementation.

Jiltseb commented 2 days ago

@ylacombe and @eustlb