linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
2.06k stars 156 forks source link

VAD does not handle almost complete silence #74

Closed freddyertl closed 12 months ago

freddyertl commented 1 year ago

In the attached sample, there is almost perfect silence at the beginning. Still there are hallucinated words.

whisper_timestamped jon.wav --model medium.en --language en --verbose True --accurate --output_dir . --output_format txt,json --vad True --detect_disfluencies True

jon.zip

Jeronymous commented 1 year ago

Unless I am missing something, there is not so much we can do about it... SILERO VAD is wrong, it returns some speech segment on the first 5 minutes where there is indeed nothing. Namely these segments (in seconds):

[
        {'start': 63.33, 'end': 71.646},
        {'start': 72.738, 'end': 122.942},
        {'start': 124.258, 'end': 133.502},
        {'start': 136.194, 'end': 157.406},
        {'start': 158.402, 'end': 210.75},
        {'start': 211.81, 'end': 242.558},
        {'start': 244.866, 'end': 263.294},
        {'start': 264.706, 'end': 267.966}
]

I'll check if this can be improved by tuning some parameters of the VAD.

freddyertl commented 1 year ago

I have also played with the threshold paramter, but for whatever reason it didn't solve the problem. If it cannot be improved with other parameters for Silero VAD, I would measure the energy in each segment that Silero returned as speech and remove those where the level is below a certain threshold. It's almost funny that in order to get rid of whisper hallucination we have some Silero hallucination.

Jeronymous commented 1 year ago

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

image

freddyertl commented 1 year ago

For me it makes perfect sense to have a first pass which detects almost silence. Then the noisy parts can be processed by a model. By the way, what does the no_speech_threshold/logprob_threshold stuff? Sounds like it would also deal with silence.

Jeronymous commented 1 year ago

no_speech_threshold is used to remove segments that the Whisper model detects as silence (it has some learnt VAD capabilities). But it's tricky to use.

In my experience logprob_threshold is not doing much.

freddyertl commented 1 year ago

You mentioned that you have something working. If you like I can play with it to see if it works in other samples.

Jeronymous commented 1 year ago

You probably refer to:

A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.

I just meant I manually zeroed out the first 5 minutes of audio, knowing that it was a "almost zero" part. And it seemed to solve the issue. I could build a more general solution, to zero out the "almost zero" parts in general, but I find it a bit awkward...

Jeronymous commented 1 year ago

First something important I forgot to mention in that thread: you can use option --plot to plot the results of the VAD (it will also plot alignment results segment by segment, so this is for debug, you might want to run it and stop it at one point).

I created a branch with an attempt to integrate auditok instead of silero VAD. The branch is called feature/auditok_vad. @freddyertl You can play with this if you want, and post your comment here, or on this pull request: https://github.com/linto-ai/whisper-timestamped/tree/feature/auditok_vad

freddyertl commented 1 year ago

Thanks, great that you could do it so quickly. I did a first round of testing and it produces correct result where Silero VAD had problems. It seems that this energy-based approach is a better fit with whisper because we don't have a noise but a silence problem. I will feed in more samples.

traidn commented 1 year ago

I also came across a bad VAD prediction on a completely silent recording. Maybe this problem can be partially solved with Pydub silent detection? I think this may be the first operation before VAD. And eventually combine the time intervals from this module with the VAD predictions.

Jeronymous commented 1 year ago

Thanks @traidn for spotting another VAD method. If you have a chance, when you see VAD issues, you can maybe try the feature/auditok_vad branch of whisper-timestamped.

dgoryeo commented 1 year ago

Hi @Jeronymous and @freddyertl , I'm getting an error:

AttributeError: module 'auditok' has no attribute 'split'

Have you come across similar error by any chance?

Jeronymous commented 1 year ago

No, I have version 0.2.0 of auditok. what says pip show auditok for you?

dgoryeo commented 1 year ago

That was it -- I upgraded to 0.2.0 and it went through. Thanks!

dgoryeo commented 1 year ago

Is there a way for me to verify which branch my whisper_timestamped is installed from? I believe I have finalised the installation from auditok branch but just need to make sure. Thanks!

Jeronymous commented 1 year ago

You can call whisper_timestamped --version (or in python whisper_timestamped.__version__). If it's 1.12.17 you're on the auditok branch

IntendedConsequence commented 1 year ago

ahah indeed, all neural nets hallucinating.

I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.

A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.

A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.

image

This looks to me like the exact same issue I encountered when I upgraded from silero-vad V3 to V4. Went back to V3 and no problems since then. Context: used it to remove non-spoken parts from thousands of different podcasts, stream vods and youtube audio over the last few years to listen on the go.

dgoryeo commented 1 year ago

@IntendedConsequence how do you go back to Silero v3? Is it by pointing the repo_or_dir in this call:

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True,
                              onnx=USE_ONNX)

Thanks.

Jeronymous commented 1 year ago

Thank you @IntendedConsequence for sharing your experience! I want to test v3, but then I have the same question as @dgoryeo ... I don't know how to point to the v3.1 model using torch.hub.load

Jeronymous commented 1 year ago

OK I found a way, with torch.hub.load(repo_or_dir=snakers4/silero-vad:v3.1, ...) but there is a very inconvenient thing happening: https://github.com/linto-ai/whisper-timestamped/pull/142#discussion_r1398256438 @IntendedConsequence can you please have a look at that comment in the PR?

IntendedConsequence commented 1 year ago

@dgoryeo @Jeronymous I addressed your questions in the PR comment link. Copying here for context and so you don't have to pointer-chase it

I found a commit that addresses this issue in silero repository. But judging from commit dates, it seems to have been merged after default switched to v4.0? I don't know what is the best option here. I personally don't use the silero repo anymore. Because I wanted a near-instant inference start on demand (to skip non-speech in my local mpv player from any playback position) I switched to a self-contained minimal C program that calls to onnxruntime's C api in a dll. I just pipe the audio from ffmpeg and it immediately returns the timestamps. Switching silero versions for me was just a matter of renaming the model file and adjusting the onnxruntime api (the V4 IIRC changed output tensor dimension).

snakers4/silero-vad@df1d520

dgoryeo commented 1 year ago

Thanks @IntendedConsequence !

Jeronymous commented 12 months ago

Since version 1.14.1, several VAD methods can be used.

The same default method is used if vad is True but one can specify:

This is documented in the README