snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
4.41k stars 431 forks source link

[C++] Questions Why python and c++ time stamps are different? #533

Open NathanJHLee opened 2 months ago

NathanJHLee commented 2 months ago

❓ Questions and Help

Hi silero team! When i try to use silero-vad using python, I felt it is good. But if i use silero-vad using c++, i got quite different result between python and c++.

I prepared silero-vad 5.1(pip) and c++ build( silero-vad-master downloaded on 2024-08-26) respectively.

#Test samle file. Voxconverse data [asr1@k-atc12 cpp]$ sox --i voxconverse_data/dev/audio/afjiv.wav

Input File : 'voxconverse_data/dev/audio/afjiv.wav' Channels : 1 Sample Rate : 16000 Precision : 16-bit Duration : 00:02:31.25 = 2419968 samples ~ 11343.6 CDDA sectors File Size : 4.84M Bit Rate : 256k Sample Encoding: 16-bit Signed Integer PCM

sha256sum ~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx 2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/data/silero_vad.onnx

#in Python.

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps model = load_silero_vad(True) #Changed it using ONNX model "True" wav = read_audio('/ws/stt/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav') speech_timestamps = get_speech_timestamps(wav, model) for timestamp in speech_timestamps: ... print(timestamp) ... {'start': 84512, 'end': 474592} {'start': 476192, 'end': 506848} {'start': 509984, 'end': 548320} {'start': 554528, 'end': 686048} {'start': 688672, 'end': 787936} {'start': 789536, 'end': 826848} {'start': 829472, 'end': 847328} {'start': 848928, 'end': 859616} {'start': 862240, 'end': 1046496} {'start': 1048096, 'end': 1068000} {'start': 1071136, 'end': 1341408} {'start': 1357344, 'end': 1379296} {'start': 1392160, 'end': 1408992} {'start': 1418784, 'end': 1427936} {'start': 1431584, 'end': 1485280} {'start': 1488928, 'end': 1511904} {'start': 1520672, 'end': 1569248} {'start': 1578016, 'end': 1610208} {'start': 1617440, 'end': 1651168} {'start': 1653280, 'end': 1675744} {'start': 1686048, 'end': 1710048} {'start': 1715232, 'end': 1726432} {'start': 1730080, 'end': 1751008} {'start': 1753120, 'end': 1773536} {'start': 1776160, 'end': 1791968} {'start': 1795104, 'end': 1813984} {'start': 1820192, 'end': 1860576} {'start': 1869344, 'end': 1907680} {'start': 1909280, 'end': 1959392} {'start': 1966624, 'end': 1989088} {'start': 2002976, 'end': 2050016} {'start': 2055712, 'end': 2077152} {'start': 2093600, 'end': 2132448} {'start': 2138656, 'end': 2147808} {'start': 2169888, 'end': 2211296} {'start': 2222112, 'end': 2244064} {'start': 2249760, 'end': 2267616} {'start': 2271264, 'end': 2302944} {'start': 2313760, 'end': 2327520}

#in c++ (Built by silrero-vad souce. I downloaded 'silero-vad-master' on 2024-08-26) changed some parameter in 'silero-vad-master/examples/cpp/silero-vad-onnx.cpp' float Threshold = 0.5, int min_silence_duration_ms = 100, int speech_pad_ms = 30, int min_speech_duration_ms = 250,

They are referred from '~/miniconda3/envs/wespeaker/lib/python3.9/site-packages/silero_vad/utils_vad.py'

sha256sum "../../src/silero_vad/data/silero_vad.onnx" 2623a2953f6ff3d2c1e61740c6cdb7168133479b267dfef114a4a3cc5bdd788f

./test [asr1@k-atc12 cpp]$ ./test numchannel :1 samplerate :16000 bits_persample:16 num_samples :2419968 num_data_size :4839936 {start:00019456,end:00200192} {start:00202752,end:00258048} {start:00261120,end:00400384} {start:00403456,end:00473600} {start:00477184,end:00506880} {start:00510976,end:00548864} {start:00555520,end:00637952} {start:00642560,end:00686592} {start:00689152,end:00727552} {start:00729600,end:00787456} {start:00790016,end:00826880} {start:00829952,end:00846848} {start:00849920,end:00858112} {start:00863232,end:01068032} {start:01071616,end:01083904} {start:01088000,end:01289216} {start:01295360,end:01311744} {start:01314816,end:01324032} {start:01326592,end:01340928} {start:01357824,end:01378816} {start:01394688,end:01408512} {start:01420288,end:01427968} {start:01432576,end:01484800} {start:01491456,end:01510912} {start:01521152,end:01569280} {start:01578496,end:01609216} {start:01619456,end:01625088} {start:01627648,end:01650176} {start:01655296,end:01676288} {start:01687040,end:01710080} {start:01716224,end:01724928} {start:01731072,end:01750528} {start:01754112,end:01762304} {start:01765888,end:01772544} {start:01777664,end:01790976} {start:01796608,end:01813504} {start:01821184,end:01859072} {start:01873408,end:01906176} {start:01910272,end:01923072} {start:01926144,end:01959936} {start:01967616,end:01989120} {start:02003968,end:02050048} {start:02058752,end:02076160} {start:02094592,end:02114048} {start:02116608,end:02131968} {start:02170880,end:02191872} {start:02195456,end:02211840} {start:02223104,end:02244096} {start:02250240,end:02267648} {start:02272256,end:02303488} {start:02314752,end:02327552}

I check both of onnx model checksum code. They are same. Any clues? Thank you.

snakers4 commented 2 months ago

Hi,

I check both of onnx model checksum code. They are same.

The next logical step would be to compare the raw probabilities output by the python code and c++ code.

If they are the same - then it's post-processing. If not - it's onnx_runtime.

You see, the c++ example is community contributed, we did not debug it.

snakers4 commented 2 months ago

Also a standard suggestion, plot the probablities for both implementations side by side with an audio envelope and probably with some marker for the speech segments, that would help debug.

NathanJHLee commented 2 months ago

oh i see. I thought your open c++ code is guranteed. Do you have plan to release official c++ code in the future?

snakers4 commented 2 months ago

oh i see. I thought your open c++ code is guranteed.

All examples are community-generated. PRs are appreciated to fix bugs.

Do you have plan to release official c++ code in the future?

Not yet.

smallsheep666 commented 2 months ago

I have found that using the same input,for example all zeros,after reset_states() the onnx model output is different from pytorch model. the onnx model's speech prob is 0.044 however the pytorch model's speech prob is 0.012.

snakers4 commented 2 months ago

jit and onnx have slightly different input formats, most likely this is the reason

smallsheep666 commented 2 months ago

I use the same parameter,the vad result may have a lot of different. May have a few more pieces

NathanJHLee commented 2 months ago

@smallsheep666
I also will check both probs between pytorch and onnx and let you know later. Thank you

smallsheep666 commented 2 months ago

I have found why there is a huge probs different between c++ and pytroch. When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples). After add context samples before using onnxruntime,the result is same. Is a bug in c++.@NathanJHLee

snakers4 commented 2 months ago

When using pytroch‘s version onnx, it has a context ,64 samples for 16k,therefor the input data is 512+64.But c++ only use the current window(512 samples). After add context samples before using onnxruntime,the result is same. Is a bug in c++.@NathanJHLee

Looks like a c++ wrapper for a previous version of the model (v4 or v3.1 ). I believe there was a PR to fix that.

NathanJHLee commented 2 months ago

I compared output probs from torch(python), Onnx(python) and Onnxruntime(c++) 3 types. I also got same results with @smallsheep666 . Onnx(c++) has problem to get probs. torch(python), Onnx(python) both are showing same probs.

My test is below silero-vad 5.1 torch 2.4.0

locate(silero-vad-master/src/silero_vad) 1. torch(python) import sys import torch sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import read_audio from utils_vad import get_speech_timestamps

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio) window_size_samples = 512

speech_probs = [] for current_start_sample in range(0, audio_length_samples, window_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) speech_prob = model(chunk, 16000).item() print(speech_prob)

2. Onnx(python) import sys sys.path.append('~/workspace/silero-vad-master/src/silero_vad')

from utils_vad import OnnxWrapper from utils_vad import read_audio from utils_vad import get_speech_timestamps

model = OnnxWrapper('data/silero_vad.onnx',force_onnx_cpu=True) audio = read_audio('/DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio) window_size_samples = 512

speech_probs = [] for current_start_sample in range(0, audio_length_samples, window_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) speech_prob = model(chunk, 16000).item() print(speech_prob)

3. Onnx(c++) I followd your instruction.(silero-vad-master/examples/cpp/README.md) And add std::cout as below 159 float speech_prob = ort_outputs[0].GetTensorMutableData()[0]; 160 std::cout<<"prob : "<< speech_prob << std::endl;
161 float *stateN = ort_outputs[1].GetTensorMutableData();

probs 1,2 are same as first coulm and 3 is second coulm.

0.01201203465461731 0.0442627 0.007816523313522339 0.0336125 0.005424141883850098 0.0221236 0.032478004693984985 0.0149333 0.023117244243621826 0.0122732 0.030117541551589966 0.00846022 0.05572396516799927 0.00648624 0.06487590074539185 0.0289339 0.046058326959609985 0.03056 0.039179474115371704 0.0349256 0.030434370040893555 0.0270224 0.027803152799606323 0.0505134 0.01884964108467102 0.0558349 0.012964963912963867 0.0883535 0.014463871717453003 0.0743203 0.0173836350440979 0.0492192 0.014908134937286377 0.0641958 0.010565102100372314 0.238843 0.00588575005531311 0.160749 0.00439077615737915 0.20331624 . . . . .

To get right probs, I made c++ source code base on libtorch(torch script). I got right result finally. Now i have to add some codes to control probs to activate start and end part . If I succeed, then I will let you know. thank you.

NathanJHLee commented 2 months ago

I found an issue that jit model couldn't use reset_states. Then model show difference probs.

for example.

import sys
import torch
sys.path.append('/home/silero-vad-master/src/silero_vad')

from utils_vad import read_audio
from utils_vad import get_speech_timestamps
from utils_vad import init_jit_model

model = torch.jit.load('data/silero_vad.jit')

audio = read_audio('DB/SD/wespeaker/voxconverse_data/dev/audio/afjiv.wav')

audio_length_samples = len(audio)
window_size_samples = 512

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

**#try to get next input wav.** I use same audio one more time for test

speech_probs = []
for current_start_sample in range(0, audio_length_samples, window_size_samples):
    chunk = audio[current_start_sample: current_start_sample + window_size_samples]
    if len(chunk) < window_size_samples:
        chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
    speech_prob = model(chunk, 16000).item()
    print(speech_prob)

Fisrt and Second probs show different result, even though i use same input data. So i added 'model.reset_states()' on yours before retrying. Then it works fine.

So i also wanna use 'model.reset_states()' to for silero-vad.jit but it only can be use 'forward()'.

According to 'silero-vad-master/src/silero_vad/vim utils_vad.py'

    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

I don't know how to call 'reset_states'. So When finished first wav file, I try to apply zero-pad value to be reset model as below.

torch::Tensor chunk = torch::zeros({batch_size, 512}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(16000); 
tensor::Tensor output = model.forward(inputs).toTensor();

But above code couldn't solve problem. I found one thing that i don't understand. When i use sample_rate and window_sample_size to 8000 and 256 respectively. i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env. I think it's not good idea. Nevertheless, model is reset correctly

torch::Tensor chunk = torch::zeros({batch_size, 256}, torch::kFloat32);     //zero-pad
std::vector<torch::jit::IValue> inputs;
inputs.push_back(chunk);    
inputs.push_back(8000); 
tensor::Tensor output = model.forward(inputs).toTensor();

So, I wanna know right way 'reset_states' correctly. I searched for a related issue and found one. on first question. On the other hand, the function 'reset_states' of the jit model can't be used in C++ code, can you provide a name so we can use it like this 'model.get_method('name')()' or 'model.run_method('name')'

Thank you.

snakers4 commented 2 months ago
    def reset_states(self, batch_size=1):
        self._state = torch.zeros((2, batch_size, 128)).float()
        self._context = torch.zeros(0)
        self._last_sr = 0
        self._last_batch_size = 0

This is a method from the ONNX wrapper, where states are reset manually. It is better to stick to the ONNX implementation if you cannot run full torchscript in Python. In Python with jit everything is handled inside of the model. With ONNX states are to be reset manually as shown in the ONNX wrapper.

i'ts trick. side effect is latency. It needs almost 100ms for simple inference in my env.

The first inference does not require 100ms. It requires zero state, zero padding and the audio chunk itself.

https://github.com/snakers4/silero-vad/blob/46f94b7d6029e19b482eebdfff0c18012fa84675/src/silero_vad/utils_vad.py#L63-L80

NathanJHLee commented 2 months ago

oh sorry i missed your model.jit function

model.run_method("reset_states"); It works fine for me. Model is reset correctly.

huxiaoyuqn commented 2 months ago

In order to be consistent with python, I added these contents at the beginning of the predict function in C++:std::vector<float> new_data(data.size()+64,0.0f); std::copy(data.begin(), data.end(), new_data.begin() + 64); input.assign(new_data.begin(), new_data.end()); But the speech_prob and the timestamp are still different from python's

huxiaoyuqn commented 2 months ago

I debugged carefully and found that there are three detailed differences between the C++ code and the python code:

  1. As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.
  2. The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.
  3. The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

Based on the above three points, I designed two vectors:

  1. Create a new member variable tamp_data to inherit the last 64 elements of the previous data;
  2. Create a new new_data in predict(). Its first 64 elements are copied from the elements of temp_data, and the remaining elements are from data. Use it to replace input as the input of input_ort;

If you understand the above points, you can modify the C++ code so that its detection results are consistent with those of Python. Finally, I still have a confusing point: the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data. My English expression is very poor, please forgive me for translating into English.

snakers4 commented 2 months ago

As mentioned before: C++ does not add 64 elements forward to the input data in function“predict()”.

The second dimension of input_node_dims should add an additional 64 element spaces when constructing the function, otherwise the last 64 elements of data will not be considered when initializing “input_ort”.

The 64 elements added are the last 64 elements of the previous data array (if it is the first data , since there is no former, the 64 elements 0 are added).

This is should be done for a v5 model. I wonder if the C++ wrapper that was supposedly adapted for v5 (in this PR https://github.com/snakers4/silero-vad/pull/482) has this change or some earlier version is discussed.

In any case, a v5 model simply would not work without these features.

the function “reset_states()” in the python code may perform different operations on the last loop prediction. This does not seem to be considered in the C++ code, which may lead to the value of speech_prob will be different when predicting the last data.

This function resets states in two places - "inside" of the model (since we cannot do it directly in ONNX inside of the model, we drag this state along in the interface, self._state) and it zeroes out the last 64 elements (self._context).

https://github.com/snakers4/silero-vad/blob/46f94b7d6029e19b482eebdfff0c18012fa84675/src/silero_vad/utils_vad.py#L46-L50

This function or its counterpart in the C++ code should be invoked:

NathanJHLee commented 2 months ago

Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?

snakers4 commented 2 months ago

Hi @snakers4 Do you have plan to release jit model half? It means quantized model right?

We used to have quantized models long time ago, bit there were many complaints that they did not run on some platforms. So we decided not to bother anymore since models are small.

NathanJHLee commented 2 months ago

Thank you for your answer. Now i have tested batch inference ,but i got difference probs from silero model. I though model have some cache to reproduce for next upcomming chunk.
Even though probs needs to check, batch inference shows much much much shorter latency results compared to the single inference. I think it's necessary. So, I found your documentation that batch inference is possible on silero V3 version. V5 also supports batch inference too? If yes, i would like to take a closer look.

NathanJHLee commented 2 days ago

Hi snakers4! Please check my PR as below. https://github.com/snakers4/silero-vad/pull/578

Thank you.