Open NathanJHLee opened 2 months ago
Hi,
I check both of onnx model checksum code. They are same.
The next logical step would be to compare the raw probabilities output by the python code and c++ code.
If they are the same - then it's post-processing. If not - it's onnx_runtime.
You see, the c++ example is community contributed, we did not debug it.
Also a standard suggestion, plot the probablities for both implementations side by side with an audio envelope and probably with some marker for the speech segments, that would help debug.
oh i see. I thought your open c++ code is guranteed. Do you have plan to release official c++ code in the future?
oh i see. I thought your open c++ code is guranteed.
All examples are community-generated. PRs are appreciated to fix bugs.
Do you have plan to release official c++ code in the future?
Not yet.
I have found that using the same input,for example all zeros,after reset_states() the onnx model output is different from pytorch model. the onnx model's speech prob is 0.044 however the pytorch model's speech prob is 0.012.
jit and onnx have slightly different input formats, most likely this is the reason
I use the same parameter,the vad result may have a lot of different. May have a few more pieces
@smallsheep666
I also will check both probs between pytorch and onnx and let you know later.
Thank you
I have found why there is a huge probs different between c++ and pytroch. When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples). After add context samples before using onnxruntime,the result is same. Is a bug in c++.@NathanJHLee
When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples). After add context samples before using onnxruntime,the result is same. Is a bug in c++.@NathanJHLee
Looks like a c++ wrapper for a previous version of the model (v4
or v3.1
). I believe there was a PR to fix that.
I compared output probs from torch(python), Onnx(python) and Onnxruntime(c++) 3 types. I also got same results with @smallsheep666 . Onnx(c++) has problem to get probs. torch(python), Onnx(python) both are showing same probs.
My test is below silero-vad 5.1 torch 2.4.0
locate(silero-vad-master/src/silero_vad) 1. torch(python) import sys import torch sys.path.append('~/workspace/silero-vad-master/src/silero_vad')
from utils_vad import read_audio from utils_vad import get_speech_timestamps
model = torch.jit.load('data/silero_vad.jit')
audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')
audio_length_samples = len(audio) window_size_samples = 512
speech_probs = [] for current_start_sample in range(0, audio_length_samples, window_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) speech_prob = model(chunk, 16000).item() print(speech_prob)
2. Onnx(python) import sys sys.path.append('~/workspace/silero-vad-master/src/silero_vad')
from utils_vad import OnnxWrapper from utils_vad import read_audio from utils_vad import get_speech_timestamps
model = OnnxWrapper('data/silero_vad.onnx',force_onnx_cpu=True) audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')
audio_length_samples = len(audio) window_size_samples = 512
speech_probs = [] for current_start_sample in range(0, audio_length_samples, window_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) speech_prob = model(chunk, 16000).item() print(speech_prob)
3. Onnx(c++)
I followd your instruction.(silero-vad-master/examples/cpp/README.md)
And add std::cout as below
159 float speech_prob = ort_outputs[0].GetTensorMutableData
161 float *stateN = ort_outputs[1].GetTensorMutableData
probs 1,2 are same as first coulm and 3 is second coulm.
0.01201203465461731 0.0442627 0.007816523313522339 0.0336125 0.005424141883850098 0.0221236 0.032478004693984985 0.0149333 0.023117244243621826 0.0122732 0.030117541551589966 0.00846022 0.05572396516799927 0.00648624 0.06487590074539185 0.0289339 0.046058326959609985 0.03056 0.039179474115371704 0.0349256 0.030434370040893555 0.0270224 0.027803152799606323 0.0505134 0.01884964108467102 0.0558349 0.012964963912963867 0.0883535 0.014463871717453003 0.0743203 0.0173836350440979 0.0492192 0.014908134937286377 0.0641958 0.010565102100372314 0.238843 0.00588575005531311 0.160749 0.00439077615737915 0.20331624 . . . . .
To get right probs, I made c++ source code base on libtorch(torch script). I got right result finally. Now i have to add some codes to control probs to activate start and end part . If I succeed, then I will let you know. thank you.
I found an issue that jit model couldn't use reset_states. Then model show difference probs.
for example.
import sys
import torch
sys.path.append('/home/silero-vad-master/src/silero_vad')
from utils_vad import read_audio
from utils_vad import get_speech_timestamps
from utils_vad import init_jit_model
model = torch.jit.load('data/silero_vad.jit')
audio = read_audio('DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')
audio_length_samples = len(audio)
window_size_samples = 512
speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)
**#try to get next input wav.** I use same audio one more time for test
speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
if len(chunk) < window_size_samples:
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
speech_prob = model(chunk, 16000).item()
print(speech_prob)
Fisrt and Second probs show different result, even though i use same input data. So i added 'model.reset_states()' on yours before retrying. Then it works fine.
So i also wanna use 'model.reset_states()' to for silero-vad.jit but it only can be use 'forward()'.
According to 'silero-vad-master/src/silero_vad/vim utils_vad.py'
def reset_states(self, batch_size=1):
self._state = torch.zeros((2, batch_size, 128)).float()
self._context = torch.zeros(0)
self._last_sr = 0
self._last_batch_size = 0
I don't know how to call 'reset_states'. So When finished first wav file, I try to apply zero-pad value to be reset model as below.
torch::Tensor chunk = torch::zeros({batch_size, 512}, torch::kFloat32); //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);
inputs.push_back(16000);
tensor::Tensor output = model.forward(inputs).toTensor();
But above code couldn't solve problem. I found one thing that i don't understand. When i use sample_rate and window_sample_size to 8000 and 256 respectively. i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env. I think it's not good idea. Nevertheless, model is reset correctly
torch::Tensor chunk = torch::zeros({batch_size, 256}, torch::kFloat32); //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);
inputs.push_back(8000);
tensor::Tensor output = model.forward(inputs).toTensor();
So, I wanna know right way 'reset_states' correctly. I searched for a related issue and found one. on first question. On the other hand, the function 'reset_states' of the jit model can't be used in C++ code, can you provide a name so we can use it like this 'model.get_method('name')()' or 'model.run_method('name')'
Thank you.
def reset_states(self, batch_size=1):
self._state = torch.zeros((2, batch_size, 128)).float()
self._context = torch.zeros(0)
self._last_sr = 0
self._last_batch_size = 0
This is a method from the ONNX wrapper, where states are reset manually.
It is better to stick to the ONNX implementation if you cannot run full torchscript in Python.
In Python with jit
everything is handled inside of the model.
With ONNX states are to be reset manually as shown in the ONNX wrapper.
i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env.
The first inference does not require 100ms. It requires zero state, zero padding and the audio chunk itself.
oh sorry i missed your model.jit function
model.run_method("reset_states"); It works fine for me. Model is reset correctly.
In order to be consistent with python, I added these contents at the beginning of the predict function in C++:std::vector<float> new_data(data.size()+64,0.0f); std::copy(data.begin(), data.end(), new_data.begin() + 64); input.assign(new_data.begin(), new_data.end());
But the speech_prob and the timestamp are still different from python's
I debugged carefully and found that there are three detailed differences between the C++ code and the python code:
Based on the above three points, I designed two vectors:
If you understand the above points, you can modify the C++ code so that its detection results are consistent with those of Python.
Finally, I still have a confusing point:
the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last
As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.
The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.
The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).
This is should be done for a v5
model. I wonder if the C++ wrapper that was supposedly adapted for v5
(in this PR https://github.com/snakers4/silero-vad/pull/482) has this change or some earlier version is discussed.
In any case, a v5
model simply would not work without these features.
the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.
This function resets states in two places - "inside" of the model (since we cannot do it directly in ONNX inside of the model, we drag this state along in the interface, self._state
) and it zeroes out the last 64 elements (self._context
).
This function or its counterpart in the C++ code should be invoked:
Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?
Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?
We used to have quantized models long time ago, bit there were many complaints that they did not run on some platforms. So we decided not to bother anymore since models are small.
Thank you for your answer.
Now i have tested batch inference ,but i got difference probs from silero model.
I though model have some cache to reproduce for next upcomming chunk.
Even though probs needs to check, batch inference shows much much much shorter latency results compared to the single inference. I think it's necessary.
So, I found your documentation that batch inference is possible on silero V3 version.
V5 also supports batch inference too? If yes, i would like to take a closer look.
Hi snakers4! Please check my PR as below. https://github.com/snakers4/silero-vad/pull/578
Thank you.
❓ Questions and Help
Hi silero team! When i try to use silero-vad using python, I felt it is good. But if i use silero-vad using c++, i got quite different result between python and c++.
I prepared silero-vad 5.1(pip) and c++ build( silero-vad-master downloaded on 2024-08-26) respectively.
#Test samle file. Voxconverse data [asr1@k-atc12 cpp]$ sox --i voxconverse_data/dev/audio/afjiv.wav
Input File : 'voxconverse_data/dev/audio/afjiv.wav' Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:02:31.25 = 2419968 samples ~ 11343.6 CDDA sectors File Size : 4.84M Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM
sha256sum ~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx 2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx
#in Python.
#in c++ (Built by silrero-vad souce. I downloaded 'silero-vad-master' on 2024-08-26) changed some parameter in 'silero-vad-master/examples/cpp/silero-vad-onnx.cpp' float Threshold = 0.5, int min_silence_duration_ms = 100, int speech_pad_ms = 30, int min_speech_duration_ms = 250,
They are referred from '~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/utils_vad.py'
sha256sum "../../src/silero_vad/data/silero_vad.onnx" 2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f
./test [asr1@k-atc12 cpp]$ ./test numchannel :1 samplerate :16000 bits_persample:16 num_samples :2419968 num_data_size :4839936 {start:00019456,end:00200192} {start:00202752,end:00258048} {start:00261120,end:00400384} {start:00403456,end:00473600} {start:00477184,end:00506880} {start:00510976,end:00548864} {start:00555520,end:00637952} {start:00642560,end:00686592} {start:00689152,end:00727552} {start:00729600,end:00787456} {start:00790016,end:00826880} {start:00829952,end:00846848} {start:00849920,end:00858112} {start:00863232,end:01068032} {start:01071616,end:01083904} {start:01088000,end:01289216} {start:01295360,end:01311744} {start:01314816,end:01324032} {start:01326592,end:01340928} {start:01357824,end:01378816} {start:01394688,end:01408512} {start:01420288,end:01427968} {start:01432576,end:01484800} {start:01491456,end:01510912} {start:01521152,end:01569280} {start:01578496,end:01609216} {start:01619456,end:01625088} {start:01627648,end:01650176} {start:01655296,end:01676288} {start:01687040,end:01710080} {start:01716224,end:01724928} {start:01731072,end:01750528} {start:01754112,end:01762304} {start:01765888,end:01772544} {start:01777664,end:01790976} {start:01796608,end:01813504} {start:01821184,end:01859072} {start:01873408,end:01906176} {start:01910272,end:01923072} {start:01926144,end:01959936} {start:01967616,end:01989120} {start:02003968,end:02050048} {start:02058752,end:02076160} {start:02094592,end:02114048} {start:02116608,end:02131968} {start:02170880,end:02191872} {start:02195456,end:02211840} {start:02223104,end:02244096} {start:02250240,end:02267648} {start:02272256,end:02303488} {start:02314752,end:02327552}
I check both of onnx model checksum code. They are same. Any clues? Thank you.