snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
3.38k stars 353 forks source link

would the c++ example still work after the new silero_vad.onnx release ? #472

Closed 1121170088 closed 1 week ago

1121170088 commented 1 week ago

i could't make it work, maybe i maked some mistakes that i don't realize.

yujinqiu commented 1 week ago

Has the same issue. It look like network structure changed ?

image

5.0 version

image

4.0 version

csukuangfj commented 1 week ago

No, The current examples in https://github.com/snakers4/silero-vad/tree/master/examples won't work with silero vad v5 as of today (2024.06.29)

I suggest that you have a look at https://github.com/k2-fsa/sherpa-onnx/pull/1064

It supports both silero vad version 4 and 5.

It provides APIs for 10 different programming languages, e.g.,

It also supports running silero VAD with Android, iOS, Flutter, NodeJS, etc.

filtercodes commented 1 week ago

When I attempt to run inference with the old model, it's running fine like this:

output, h, c = session.run(['output', 'hn', 'cn'], {input_name: input_tensor, sr_name: np.array([sample_rate], dtype=np.int64), h_name: h, c_name: c})

With the new model i would assume it's this way:

output, s_n = session.run(['output', 'stateN'], {input_name: input_tensor, sr_name: np.array([sample_rate], dtype=np.float32), state_n: stateN})

But I get an error -> input: state Got: 1 Expected: 3 Please fix either the inputs/outputs or the model.

I do send 3 inputs with input_name, sr_name and state_n... and hard coded the outputs from the model

also I tried reshaping the stateN = s_n.reshape((2, 1, -1)) but it's the same.

What am I missing here?

csukuangfj commented 1 week ago

what is the shape of input_tensor and stateN? @filtercodes

filtercodes commented 1 week ago

Thanks for the reply,

I created input_tensor from audio buffer that has been converted to float32 previously using int2float() from cpp example.

input_tensor = np.expand_dims(audio_float32, axis=0)

it's an audio buffer of 1024 samples.

and

stateN = np.zeros((2, 1, 128), dtype=np.float32)