Closed yaronwinter closed 3 months ago
I wouldn't call that "very clear speech", in general. Have you tried lowering the threshold?
Thanks for the response! You refer to 'threshold' parameter, right?
(i.e. speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=sr, threshold=0.15))
Well, at 0.45 and higher it does not recognize any speech. When decreasing it gradually it recognizes a few short segments of speech (not satisfactory), and then, at 0.15, it detects the whole signal as speech... It is strange, as before the upgrae to 5.1 it detected speech pretty well on this data, with not need for threshold tuning.
Comparing to real-life applications (e.g. call centers, medical health support, etc) this example is not challenging at all, thus is the "very clear speech"...
Can you plot the speech probability vs time for a sample audio for both v4 and v5? Have specified other parameters like min speech duration, min silence duration..?
How can I run v4? Up to version v4 there is only the torch hub option for getting the model and modules, isn't it? In fact I had not installed SileroVAD at all, but rather used the torch hub for importing the modules. Only after it stopped detecting speech I found that v5.1 was released, but I haven't figured out how to return to v4...
@yaronwinter same problem for me, have you rollback to V4 successfully?
How can I run v4?
It is always worth doing the following:
For this particular case v4.0 gives this probability chart:
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad:v4.0',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)
Which is nice, but almost the whole audio is speech anyway except for the starting bit anyway.
For the latest version it is:
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)
Here choosing the proper hyper params is next to impossible, because almost whole audio is speech and there are no breaks.
To summarize, in this particular case the whole audio is mostly speech and probably using 4.0 is better, because it was not tuned on call-like data domain (typically calls that we have in the call-center are much less noisy) and the model thinks that this speech is background speech most likely.
In any case you have three models to choose from - v3.1
, v4.0
and latest.
@snakers4 From my experiments, People should choose v3.1 or v4.0 with call-center audios for stable results. Anyway, thanks you very much!!!
Looks like it depends on the audio quality. In our case audio quality is typically higher, hence we were optimizing the background speech objective as well.
I do not really know how to handle this better. If there will be more edge cases, please open another issue. Maybe we will think of something, i.e. how to make VAD run in several modes.
The same problem also applies to singing, music, murmur, background TV noises, parrot speech, etc
Hey,you know what,I run into the same question in V4 model,and I avoid it by using V5 model. But it seems like it will have same problem when process specific data.
Right, it's the generic ML problem: any model performs best on data that is similar to its train set, and the performance degrades for less similar data. When I switched to V5 there was a massive decline in performance. But more comprehensive tests afterwards had showed that V5 has also advantages in some areas:(
Tried V5 model on my low-quality noisy call records too. V4 definitely performs better, as it gives lower probability for background voice and higher for those speaking directly to mic. which is good to my domain.
Discussed in https://github.com/snakers4/silero-vad/discussions/514