snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.45k stars 435 forks source link

Not able to perform real time VAD detection #411

Closed devashish-gopalani-cognoai closed 10 months ago

devashish-gopalani-cognoai commented 11 months ago

I have a system where I am getting audio in realtime. I want to perform VAD on it to determine if the audio being sent is speech or not. For doing that I have written the below code -

import torch

torch.set_num_threads(1)
USE_ONNX = False # change this to True if you want to test onnx model
self.model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                            model='silero_vad',
                            force_reload=True,
                            onnx=USE_ONNX)

(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils

self.vad_iterator = VADIterator(self.model)

speech_decoded = base64.b64decode(text_data_json.get("media").get("payload"))
print(self.vad_iterator(speech_decoded, return_seconds=True))
print(self.model(speech_decoded, 8000).item())

When I perform the first print, I am getting the following error - Audio cannot be casted to tensor. Cast it manually Can someone help me fix this error? speech_decoded is basically the audio received in a decoded base64 string.

UPDATE:

I changed the implementation to the below code:

import torch

torch.set_num_threads(1)
USE_ONNX = False # change this to True if you want to test onnx model
self.model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                            model='silero_vad',
                            force_reload=True,
                            onnx=USE_ONNX)

(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils

self.vad_iterator = VADIterator(self.model)

speech_decoded = base64.b64decode(text_data_json.get("media").get("payload"))
audio_int16 = np.frombuffer(speech_decode, np.int16)
audio_float32 = self.int2float(audio_int16)

print(self.vad_iterator(audio_float32, return_seconds=True))
print(self.model(audio_float32, 8000).item())

On running the above code I am getting the below error

The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/vad/model/vad_annotator.py", line 98, in forward
    _16 = torch.gt(torch.div(sr1, (torch.size(x2))[1]), 31.25)
    if _16:
      ops.prim.RaiseException("Input audio chunk is too short", "builtins.ValueError")
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    else:
      pass

Traceback of TorchScript, original code (most recent call last):
  File "/home/keras/notebook/nvme_raid/adamnsandle/silero-models-research/vad/model/vad_annotator.py", line 364, in forward

        if sr / x.shape[1] > 31.25:
            raise ValueError("Input audio chunk is too short")
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

        return x, sr
builtins.ValueError: Input audio chunk is too short

The length of the audio which I am sending is of 20ms. Is this not supported? If not, then are there any workarounds? I tried concatenating multiple audio packets so that the duration increases but then I get the below error -

Error while processing frame

devashish-gopalani-cognoai commented 11 months ago

@snakers4 any sort of help for the above issue would be appreciated

Simon-chai commented 11 months ago

The error message is clear enough,in your case,if you want to use this code self.vad_iterator(audio_float32, return_seconds=True) you better change to self.vad_iterator(audio_float32, return_seconds=True,sampling_rate=8000) and make sure len(audio_float32 ) == 256 and evrything would be fine then. Little advice: you should concatenate your payload first and iterate them 256 by 256

devashish-gopalani-cognoai commented 10 months ago

@Simon-chai, thanks for the help. I will try to implement this and get back to you if I face any difficulties. Till then, I am closing this issue.