Improve batching speed for Whisper models using VAD based chunking

Feature request

This feature request aims to improve the speed of Whisper's batched version by adding a VAD model (such as pyannote or from NeMO or Silero) and merging chunks up to 30 sec, instead of relying on the Sliding window technique that takes up more time.

Motivation

In addition to the benefits stated above, the semantic chunks take care of the transcription at the 30 sec boundaries.

Your contribution

A similar batching has already been implemented in faster-whisper project with Silero VAD. The WER and Speed on internal test data and youtube-commons-asr-eval (for long-form transcription) are better on this implementation than the current HF implementation.

huggingface / transformers