m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.25k stars 1.18k forks source link

Benchmarks for whisperx, faster-whisper, and whispers2t! #817

Open BBC-Esq opened 3 months ago

BBC-Esq commented 3 months ago

Hey all, after a nice conversation with @MahmoudAshraf97 on a different repo I wanted to share some of my benchmark data. This was created using an RTX 4090 on Windows, no flash attention, with 5 beams. I'd love to include data for whisper.cpp as well as huggingface's implementation but unfortunately when the HF implementation uses any beam size above 1 the vram usage skyrockets...and I'm not aware of any python bindings for .cpp that can use cuda acceleration. Hope ya'll find it as interesting as it was for me to test!

image

MahmoudAshraf97 commented 3 months ago

Interesting results indeed thanks for sharing, but afaik whispers2t is just an interface for multiple backends, so which one are you using here?

BBC-Esq commented 3 months ago

Oh yeah, sorry, using the ctranslate2 backend. It's important to note that it's ctranslate2 and not just faster-whisper. As far as I know, whisperX and whisperS2T are the only repositories that have batch processing using ctranslate2. faster-whisper should hopefully be getting it soon, however. See Here.

At any rate, out of respect for the hard work of all the repositories I'm benching, it's important to note that different libraries have different benefits/drawbacks...my benchmarks are only for speed purposes.

Infinitay commented 2 months ago

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

BBC-Esq commented 2 months ago

Recently https://github.com/ictnlp/StreamSpeech was released and I'm curious how it pairs up. Although currently it doesn't support many language unless you train it yourself and it's more real-time focused. Any chance you could benchmark it alongside whisperx if possible? Thanks

Interesting...Thanks for the link. I briefly checked it out and the model names imply that they only handle translation. I didn't see a model that handled straight transcription from one language to the same language. With that being said, if you find out otherwise and provide me with a basic script that can perform inference, I'll fine tune it to get vram measurements and timing and process the same audio file that my other benchmarks did?

stri8ed commented 2 months ago

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

MahmoudAshraf97 commented 2 months ago

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

stri8ed commented 2 months ago

I guess whisperx paper showed that using previous segment transcription in the prompt isn't useful

Indeed. I recall reading that. Anecdotally, that does not seem to be the case for me, but interested to hear if anyone else has more data on that. Intuitively, I would expect additional context to be useful, given the model was trained to condition the result based on the prompt/context.

BBC-Esq commented 2 months ago

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far:

https://github.com/shashikg/WhisperS2T/releases

Jiltseb commented 2 months ago

Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER.

stri8ed commented 1 month ago

It looks like whispers2t does not use the previous segment transcription as context. This is the same with WhisperX. Would be interesting to see WER benchmarks alongside the performance, especially for long audio, which may be more sensitive to the context, or lack thereof.

If you go here you can see that the WER rate is actually better...lol. Still trying to figure that out out, but the guy seems solid in his testing so far:

https://github.com/shashikg/WhisperS2T/releases

Has there been any comparisons with Faster-Whisper, non-batched, with VAD, on long form transcription? Looking at the benchmarks you linked to, it seems the only sequential implementation that was tested, is the OpenAI one, which does not implement VAD preprocessing. It's well know that VAD results in improvement's.

stri8ed commented 1 month ago

Generally very long context (>30 sec) is not needed for ASR (unlike paralinguistic tasks). By not passing in the previous context, we can prevent some repetitions/hallucinations from passing on to the next segment, as we see in batched faster_whisper, and inturn better WER.

Is that the the case with long-form audio? The tests in those benchmarks looks tiny, and even in that case, faster-whisper non batched shows lower WER.

Would like to seem more benchmarks on long-form audio (multi-hour), since that is where I would expect to see most gains/losses from batching vs sequential.