ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
2.11k stars 256 forks source link

VAD and whisper-timestamped #30

Closed Jeronymous closed 12 months ago

Jeronymous commented 1 year ago

First, thank you. I am super happy to see whisper-timestamped used in such a good project. Having Whipser streamed in real time is a super feature!

I see here that VAD is not available when using whisper-timestamped backend: https://github.com/ufal/whisper_streaming/blob/23c2d568d8262a910a83b01025faa12244255756/whisper_online.py#L79-L80

But VAD IS implemented in whisper-timestamped (it was even before faster-whisper integrated it). It's currently based on SILERO (same as what was done in faster-whisper). Am I missing a sticking point? (Maybe the fact that things required for VAD are not by default in the requirements?) I can contribute if help is needed on this.

(VAD is important to prevent some hallucinations of Whisper models, and make timestamps more accurate)

Also, I want to mention: After being disappointed with weird results on some files, I opened a branch to replace SILERO with AUDITOK : https://github.com/linto-ai/whisper-timestamped/pull/78 (see the linked issue to have an illustration of possible "hallucinations" of Silero). I had good experience with Auditok. I was hoping some user feedback to confirm before merging in master. But as it's not coming, maybe we just need to establish a benchmark to confirm the improvement.

Gldkslfmsd commented 1 year ago

Hi, thanks for feedback. Yes, I know that VAD is in whisper_timestamped. I put NotImplemented because I primarily use and focus on faster-whisper backend. Feel free to implement it -- it should be easy, passing parameter to a function, analogically to https://github.com/ufal/whisper_streaming/blob/23c2d568d8262a910a83b01025faa12244255756/whisper_online.py#L136

SILERO vs AUDITOK is a topic for another issue. I don't have feedback.

Gldkslfmsd commented 1 year ago

but I realized that VAD is now used ineffectively. In every update it's processed on the whole buffer. It could be used to cut silence out of the buffer, so that next update is faster. This could be improved

Gldkslfmsd commented 12 months ago

SILERO vs AUDITOK is a topic for another issue. I don't have feedback.

@Jeronymous , please open an issue about this, if you'll have a test results to share