snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.1k stars 402 forks source link

⚠️Public pre-test of Silero-VAD v5 #448

Closed snakers4 closed 3 months ago

snakers4 commented 5 months ago

Dear members of the community,

Finally, we are nearing the release of the v5 version of the VAD.

Can you please send your audio edge cases in this ticket so that we could stress test the new release of the VAD in advance.

Ideally we need something like this https://github.com/snakers4/silero-vad/issues/369 (which we incorporated into validation when choosing the new models), but any systematic cases where the VAD underperforms will be good as well.

Many thanks!

rizwanishaq commented 5 months ago

"When is the release scheduled for v5?"

whaozl commented 5 months ago

I find the v4 for chinese single word 【bye】,is not good.

and the cantonese single word 【喺啊】and 【喺】 is not good.

asusdisciple commented 5 months ago

I do not have any edge cases but it would be nice if you could change your benchmark methodology. There are a lot models out there by now. Adopting some new datasets like dihard3 etc. and comparing them against other sota models like pyannote would be dope.

Purfview commented 5 months ago

Systematic cases would be: The false-positives on ~silence. (introduced in v4) Inaccurate end of segments, trailing usually includes up to ~1000 ms of "padding". (introduced in v4) Maybe not systematic but often the start of segment is ~100ms too late.

cassiotbatista commented 4 months ago

Hi, it's me again 😄

We've done some experiments on what we called "model expectation" w.r.t. the LSTM states' reset frequency.

Recall from the previous issue that my interest is mainly in always-on scenarios, which consist of a VAD listening all the time to whatever is going on in the environment and triggering only when there's speech, which we'll assume to be a rare event. As such, the model would be expected to trigger only a few times (a day, say) w.r.t. the infinite audio stream that it keeps receiving over time.

The experiment consists in feeding long-ish stream of non-speech data to the model and check how often it hallucinates --- i.e., how often it sees speech when there is none. For that, we used Cafe, Home and Car environments from QUT-NOISE dataset, which contains 30-50 minute-long noise-only audio recordings.

In theory, we presume that one is advised to reset the model states only after it has seen speech, but we took the liberty to reset at regular time intervals irrespective of whether speech detection has been triggered.

The following plots show Scikit learn error rate (1-acc, which goes up to 100% == 1.00), therefore formulating the VAD as a frame-wise binary classification problem. X-axis show the frequency of model state resetting. Finally, v3 and v4 models are shown in blue and red colors, respectively.

pic 1 pic 2 pic 3
QUT Cafe QUT Home QUT Car

I'll formulate my conclusions later when I have time, just wanted to provide a heads-up asap since it's been a while since this issue has been opened.


EDIT: conclusions!

First of all, just notice that the graphs are not in the same scale, so the models make way less mistakes in car environments (4% vs. ~20% otherwise), for example.

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

Any feedback on these results would be welcome @snakers4 😄

snakers4 commented 3 months ago

A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day.

We focused on this scenario when training the new VAD since we had some datasets and our own issues when running noise only / "speechless" audios through the VAD.

The new VAD version was released just now - https://github.com/snakers4/silero-vad/issues/2#issuecomment-2195433115.

We changed the way it handles context now - we pass a part of the previous chunk as well as the current chunk and we made the LSTM component 2x smaller but improved the feature pyramid pooling (we has an improper pooling layer).

So in theory and in our practice the new VAD should work better with this edge case.

Can you please re-run some of your tests, and if the issue persists - please open a new issue referencing this one as context.

Many thanks!