Closed freddyertl closed 11 months ago
Unless I am missing something, there is not so much we can do about it... SILERO VAD is wrong, it returns some speech segment on the first 5 minutes where there is indeed nothing. Namely these segments (in seconds):
[
{'start': 63.33, 'end': 71.646},
{'start': 72.738, 'end': 122.942},
{'start': 124.258, 'end': 133.502},
{'start': 136.194, 'end': 157.406},
{'start': 158.402, 'end': 210.75},
{'start': 211.81, 'end': 242.558},
{'start': 244.866, 'end': 263.294},
{'start': 264.706, 'end': 267.966}
]
I'll check if this can be improved by tuning some parameters of the VAD.
I have also played with the threshold paramter, but for whatever reason it didn't solve the problem. If it cannot be improved with other parameters for Silero VAD, I would measure the energy in each segment that Silero returned as speech and remove those where the level is below a certain threshold. It's almost funny that in order to get rid of whisper hallucination we have some Silero hallucination.
ahah indeed, all neural nets hallucinating.
I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.
A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.
A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.
For me it makes perfect sense to have a first pass which detects almost silence. Then the noisy parts can be processed by a model. By the way, what does the no_speech_threshold/logprob_threshold stuff? Sounds like it would also deal with silence.
no_speech_threshold
is used to remove segments that the Whisper model detects as silence (it has some learnt VAD capabilities). But it's tricky to use.
In my experience logprob_threshold
is not doing much.
You mentioned that you have something working. If you like I can play with it to see if it works in other samples.
You probably refer to:
A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.
I just meant I manually zeroed out the first 5 minutes of audio, knowing that it was a "almost zero" part. And it seemed to solve the issue. I could build a more general solution, to zero out the "almost zero" parts in general, but I find it a bit awkward...
First something important I forgot to mention in that thread: you can use option --plot
to plot the results of the VAD
(it will also plot alignment results segment by segment, so this is for debug, you might want to run it and stop it at one point).
I created a branch with an attempt to integrate auditok
instead of silero VAD.
The branch is called feature/auditok_vad
.
@freddyertl You can play with this if you want, and post your comment here, or on this pull request: https://github.com/linto-ai/whisper-timestamped/tree/feature/auditok_vad
Thanks, great that you could do it so quickly. I did a first round of testing and it produces correct result where Silero VAD had problems. It seems that this energy-based approach is a better fit with whisper because we don't have a noise but a silence problem. I will feed in more samples.
I also came across a bad VAD prediction on a completely silent recording. Maybe this problem can be partially solved with Pydub silent detection? I think this may be the first operation before VAD. And eventually combine the time intervals from this module with the VAD predictions.
Thanks @traidn for spotting another VAD method.
If you have a chance, when you see VAD issues, you can maybe try the feature/auditok_vad
branch of whisper-timestamped.
Hi @Jeronymous and @freddyertl , I'm getting an error:
AttributeError: module 'auditok' has no attribute 'split'
Have you come across similar error by any chance?
No, I have version 0.2.0 of auditok.
what says pip show auditok
for you?
That was it -- I upgraded to 0.2.0 and it went through. Thanks!
Is there a way for me to verify which branch my whisper_timestamped is installed from? I believe I have finalised the installation from auditok branch but just need to make sure. Thanks!
You can call whisper_timestamped --version
(or in python whisper_timestamped.__version__
).
If it's 1.12.17 you're on the auditok branch
ahah indeed, all neural nets hallucinating.
I looked at the probabilities of Silero neural nets, it turns out that it's completely crap on region where the input audio is almost 0 (see figure below). There is no special local normalization preprocessing, so it's like if the recurrent neural network that is (internally) amplifying tiny variations of audio signal.
A solution can be to zero out parts of the signal that are "almost zero for sometime". I tested, it works. But it's awkward to make that a general solution.
A better solution seems to be to use an "audio activity detector" like https://github.com/amsehili/auditok before using silero VAD. Or just using auditok (not silero VAD)... Because I feel that Whisper is more robust to noise/music than to silence.
This looks to me like the exact same issue I encountered when I upgraded from silero-vad V3 to V4. Went back to V3 and no problems since then. Context: used it to remove non-spoken parts from thousands of different podcasts, stream vods and youtube audio over the last few years to listen on the go.
@IntendedConsequence how do you go back to Silero v3? Is it by pointing the repo_or_dir in this call:
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)
Thanks.
Thank you @IntendedConsequence for sharing your experience!
I want to test v3, but then I have the same question as @dgoryeo ... I don't know how to point to the v3.1 model using torch.hub.load
OK I found a way, with torch.hub.load(repo_or_dir=snakers4/silero-vad:v3.1, ...)
but there is a very inconvenient thing happening: https://github.com/linto-ai/whisper-timestamped/pull/142#discussion_r1398256438
@IntendedConsequence can you please have a look at that comment in the PR?
@dgoryeo @Jeronymous I addressed your questions in the PR comment link. Copying here for context and so you don't have to pointer-chase it
I found a commit that addresses this issue in silero repository. But judging from commit dates, it seems to have been merged after default switched to v4.0? I don't know what is the best option here. I personally don't use the silero repo anymore. Because I wanted a near-instant inference start on demand (to skip non-speech in my local mpv player from any playback position) I switched to a self-contained minimal C program that calls to onnxruntime's C api in a dll. I just pipe the audio from ffmpeg and it immediately returns the timestamps. Switching silero versions for me was just a matter of renaming the model file and adjusting the onnxruntime api (the V4 IIRC changed output tensor dimension).
snakers4/silero-vad@df1d520
Thanks @IntendedConsequence !
Since version 1.14.1, several VAD methods can be used.
The same default method is used if vad is True but one can specify:
--vad="auditok"
, or--vad="silero:3.1"
This is documented in the README
In the attached sample, there is almost perfect silence at the beginning. Still there are hallucinated words.
whisper_timestamped jon.wav --model medium.en --language en --verbose True --accurate --output_dir . --output_format txt,json --vad True --detect_disfluencies True
jon.zip