ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
1.49k stars 187 forks source link

New Fork: Web client + WebSocket + own VAD impl. #105

Open marcinmatys opened 1 week ago

marcinmatys commented 1 week ago

I have created fork of whisper_streaming , so I took the liberty of writing about it here. We may close this issue soon as it is information only.

I encourage you to check it out if you are interested in topics such as Web Browser-Based client with WebSocket Communication, Voice Activity Detection, and Silence Processing.

If you have any comments, please write here or check out feedback section in my README

vuduc153 commented 1 week ago

@marcinmatys Hi, thanks for the fork it's really a godsend since I was looking to put together something similar. :) One thing I notice is that the VAD seems to reset the timestamp to 0 every time it starts again after a silence period. Is this the expected behavior?

marcinmatys commented 1 week ago

@vuduc153 Thanks for your feedback.

When silence is detected, OnlineASRProcessor finish() and init() methods are called to read uncommited transcription and clear buffer. We loose context and have uncommited transcription then, but in my opinion, it does not have a significant impact on quality. However, I must say that this implementation is just my experiment. You have to do the tests yourself and decide whether it is appropriate or not.

You could remove line online.init() from below code and check the difference.

if not silence_started:
     o = online.finish()
     online.init()
vuduc153 commented 1 week ago

@marcinmatys Thanks for the reply I just wanted to confirm if that's indeed to intended logic. There's also an issue with really long pauses (>10s) with the current code. Since rms is calculated as the square root mean of the ongoing silence_candidate_chunk, after a long pause when the speech starts again, rms will still be under the SILENCE_THRESHOLD for a while until the new data brings the mean back up above the threshold. From my experience it would take around 1/10 the duration of the pause for the ASR to picks up again, which means the first sentence after a pause will lose some words at the beginning.

Calculating rms per received audio might be a better way to approach this. I have slightly modified the logic in this section in PR. Let me know what you think.

Gldkslfmsd commented 1 week ago

Thanks for a nice work, @marcinmatys . I shortly looked at your README2 and I found out that you're using numpy sound intensity detection as "VAD". I think that that way you can detect silence vs non-silence. What about noise vs. speech?

In the vad_streaming branch I'm using Silero VAD, a neural torch model to detect non-voice (such as noise, silence, music etc.) vs voice. It should be more robust than your numpy approach. Silero is used in the default offline Whisper as VAD and it was recommended to me in #39 .

marcinmatys commented 1 week ago

@vuduc153 Thanks for this information and PR. You are right; there is probably an issue with long pauses. However, there is also a problem with your new logic. We need to improve your fix. I will write the details in the PR comment.

marcinmatys commented 1 week ago

@Gldkslfmsd Thank you for your response and explanations. I need to look at and test vad-streamin branch one more time and check your silence removal logic. Do you have any plans to finally verify vad-streaming and merge it into the main branch?

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

Gldkslfmsd commented 1 week ago

@Gldkslfmsd Thank you for your response and explanations. I need to look at and test vad-streamin branch one more time and check your silence removal logic.

Do you have any plans to finally verify vad-streaming and merge it into the main branch?

It's verified, it works very well but the code is ugly. It needs to be cleaned, made transparent and self-documented. Then it can be merged.

Not in my time schedule now.

Silero definitely has more capabilities as you said, but in some cases, I think numpy can also handle it. It depends on the environment we are in, whether we have noise around us, what kind of noise we have around us, and what microphone we are using.

We have two types of microphones: Headset Microphone: The microphone in a headset that is positioned near the mouth. Omnidirectional Microphone: A microphone used in conference settings that captures sound from all directions.

I performed some tests using a Headset Microphone and played some conversations (it was probably football match commentaries) from another speaker on the desk next to me. The Headset Microphone did not pick up this noise even when the other speaker was really close.

Do you thik that numpy sound intensity detection could works more efficiently than Silero ? Maybe there should be an option to use one of these. If we need a more robust tool, we use Silero, but if not, we use simple numpy.

I believe there are some good reasons why Silero exists. Check their paper and other VAD papers. They may have it tested rigorously, you can reproduce some test.

Numpy may be faster, simpler to install, and good enough for many. If you present an evidence, we can integrate it as an option.