snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.45k stars 435 forks source link

The processed file sounds correctly after the VAD, but the model still seems to know where the big silence gap was and stops at this place. #311

Closed ghost closed 1 year ago

ghost commented 1 year ago

❓Help - The processed file sounds correctly after the VAD, but the model still seems to know where the big silence gap was and stops at this place.

Hi, maybe someone faced with such a situation: I'm trying to apply the silero-VAD before the wav2vec model. The processed file sounds correctly after the VAD, but the model still seems to know where the big silence gap was and stops at this place. Any guesses as to why this might be?

snakers4 commented 1 year ago

Not sure I can understand. Can you maybe post a minimal example?

ghost commented 1 year ago

Not sure I can understand. Can you maybe post a minimal example?

The idea is as follows: apply VAD to the original file audio.wav, and use the _collectchunks() function to get the processed file without non-speech fragments - _onlyspeech.wav. Then pass this _onlyspeech.wav file as the input to the wav2vec model. But the thing is, if the original audio file contains longer pauses (about 2 sec), VAD process them successfully and sounds correctly, but the wav2vec model still seems to know where this pause was and stops at this place as if identifying it as the end of the file. And I can't figure out how the model identifies this pause in the _onlyspeech.wav file. As if there is some marker...

ghost commented 1 year ago

audio_files.zip there are files for an example. wav2vec model transcribed just this part (using _onlyspeech.wav file): IN OUR WORK WE ARE OFTEN SURPRISED BY THE FACT THAT MOST PEOPLE KNOW ABOUT AUTOMATIC SPEECH RECOGNITION BUT NOW VERY LITTLE ABOUT VOICE ACTIVITY DETECTION