snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.43k stars 432 forks source link

Does not recognize speech after upgrading to V5.1 #515

Closed yaronwinter closed 3 months ago

yaronwinter commented 3 months ago

Discussed in https://github.com/snakers4/silero-vad/discussions/514

Originally posted by **yaronwinter** August 7, 2024 I have been using SileroVAD for a few months now. After upgrading to V5.1 it suddenly fails to recognize very clear speech. I have tried using both the torch.hub method and direct usage of the package modules, and in both cases it did not recognize anything in a signal with very clear speech: ![torch_hub](https://github.com/user-attachments/assets/051b5bda-2b67-419c-9ad5-ee1ff0732323) ![torch_hub](https://github.com/user-attachments/assets/e934ce8c-9a7e-4c41-8424-e4642eea1b50) And here is the audio file: https://github.com/user-attachments/assets/86ce0faa-778f-4f7b-89a1-a357ac0aa3ba I would appreciate any advice! Thanks, Yaron
x86Gr commented 3 months ago

I wouldn't call that "very clear speech", in general. Have you tried lowering the threshold?

yaronwinter commented 3 months ago

Thanks for the response! You refer to 'threshold' parameter, right?

(i.e. speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sr, threshold=0.15))

Well, at 0.45 and higher it does not recognize any speech. When decreasing it gradually it recognizes a few short segments of speech (not satisfactory), and then, at 0.15, it detects the whole signal as speech... It is strange, as before the upgrae to 5.1 it detected speech pretty well on this data, with not need for threshold tuning.

Comparing to real-life applications (e.g. call centers, medical health support, etc) this example is not challenging at all, thus is the "very clear speech"...

x86Gr commented 3 months ago

Can you plot the speech probability vs time for a sample audio for both v4 and v5? Have specified other parameters like min speech duration, min silence duration..?

yaronwinter commented 3 months ago

How can I run v4? Up to version v4 there is only the torch hub option for getting the model and modules, isn't it? In fact I had not installed SileroVAD at all, but rather used the torch hub for importing the modules. Only after it stopped detecting speech I found that v5.1 was released, but I haven't figured out how to return to v4...

leminhnguyen commented 3 months ago

@yaronwinter same problem for me, have you rollback to V4 successfully?

snakers4 commented 3 months ago

How can I run v4?

https://github.com/snakers4/silero-vad/issues/474

snakers4 commented 3 months ago

It is always worth doing the following:

For this particular case v4.0 gives this probability chart:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad:v4.0',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

image

Which is nice, but almost the whole audio is speech anyway except for the starting bit anyway.

For the latest version it is:

  model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                model='silero_vad',
                                force_reload=True,
                                onnx=USE_ONNX)

image

Here choosing the proper hyper params is next to impossible, because almost whole audio is speech and there are no breaks.

To summarize, in this particular case the whole audio is mostly speech and probably using 4.0 is better, because it was not tuned on call-like data domain (typically calls that we have in the call-center are much less noisy) and the model thinks that this speech is background speech most likely.

In any case you have three models to choose from - v3.1, v4.0 and latest.

leminhnguyen commented 3 months ago

@snakers4 From my experiments, People should choose v3.1 or v4.0 with call-center audios for stable results. Anyway, thanks you very much!!!

snakers4 commented 3 months ago

Looks like it depends on the audio quality. In our case audio quality is typically higher, hence we were optimizing the background speech objective as well.

I do not really know how to handle this better. If there will be more edge cases, please open another issue. Maybe we will think of something, i.e. how to make VAD run in several modes.

The same problem also applies to singing, music, murmur, background TV noises, parrot speech, etc

Simon-chai commented 2 months ago

Hey,you know what,I run into the same question in V4 model,and I avoid it by using V5 model. But it seems like it will have same problem when process specific data.

yaronwinter commented 2 months ago

Right, it's the generic ML problem: any model performs best on data that is similar to its train set, and the performance degrades for less similar data. When I switched to V5 there was a massive decline in performance. But more comprehensive tests afterwards had showed that V5 has also advantages in some areas:(

yuGAN6 commented 2 months ago

Tried V5 model on my low-quality noisy call records too. V4 definitely performs better, as it gives lower probability for background voice and higher for those speaking directly to mic. which is good to my domain.