snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.41k stars 432 forks source link

non-ONNX version of new VAD model doesn't work with 8 kHz audio #250

Closed khusainovaidar closed 2 years ago

khusainovaidar commented 2 years ago

New version of VAD non-ONNX model doesn't work with 8 kHz audio. Code from example:

SAMPLING_RATE = 8000 USE_ONNX = False model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True, onnx=USE_ONNX)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

wav = read_audio('any_8k_audio_file.wav', sampling_rate=SAMPLING_RATE) speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)

ValueError: only one element tensors can be converted to Python scalars on line 252 of utils_vad: speech_prob = model(chunk, sampling_rate).item()

khusainovaidar commented 2 years ago

for my example model returns dim 3 tensor and it kills it on item()

khusainovaidar commented 2 years ago

Plus maybe it's not quite the same issue, but we found (subjectively) quality degradation of new version of VAD (ONNX version). We tested it on clear samples and it skips lots of voiced segments now. At the same time previous version works excellent.

snakers4 commented 2 years ago

Plus maybe it's not quite the same issue, but we found (subjectively) quality degradation of new version of VAD (ONNX version). We tested it on clear samples and it skips lots of voiced segments now. At the same time previous version works excellent.

Please create a separate ticket with the audio files, hyper-parameters you are using and please plot the probability charts.

snakers4 commented 2 years ago

'any_8k_audio_file.wav',

Please provide your audio file.

khusainovaidar commented 2 years ago

Please provide your audio file.

It really doesn't matter. It fails with any tensor i tried, f.i. wav = torch.Tensor(1, 100000). With first one found from the Internet also fails:

wav, sr = torchaudio.load('http://mauvecloud.net/sounds/pcm1608m.wav') speech_chunks = get_speech_timestamps( wav, model, sampling_rate=8000 )

adamnsandle commented 2 years ago

@khusainovaidar Hotfixed. Thanks for reporting! Wrong models were uploaded accidentally. Latest models are now in repo, please check quality on them. P.S for 8k model it's better to use kwarg window_size_samples=256