Closed 7r3nzy closed 2 years ago
Hey, thanks for the report. I had to make a few fixes, but I got it to work (I think). Use the linked branch if you want to try, I will merge it ASAP.
All in one python inference (+ exported wav tensor for reuse in tract).
import torchaudio
import onnxruntime
import numpy
session = onnxruntime.InferenceSession("files/silero_vad.onnx")
wav, sr = torchaudio.load("en.wav")
assert sr == 16000
samples = wav.shape[1]
window_size_samples = 1536
h = numpy.zeros((2, 1, 64)).astype('float32')
c = numpy.zeros((2, 1, 64)).astype('float32')
io = dict()
io["wav"] = wav.numpy().astype('float32')
numpy.savez("io.npz", **io)
output = []
for (chunk, win) in enumerate(range(0, samples, window_size_samples)):
x = wav[:,win:win + window_size_samples]
if x.shape[1] < window_size_samples:
x = torch.nn.functional.pad(x, (0, 0, 0, int(window_size_samples - len(x.shape[1]))))
ort_inputs = {'input': x.numpy(), 'h0': h, 'c0': c}
y, h, c = session.run(None, ort_inputs)
output.append(y[(0,1,0)])
min_silence_duration_ms = 100
min_speech_duration_ms = 250
threshold = 0.5
neg_threshold = 0.35
triggered = False
current_speech = 0
temp_end = 0
min_silence_samples = min_silence_duration_ms * 16000 / 1000
min_speech_samples = min_speech_duration_ms * 16000 / 1000
for i, speech_prob in enumerate(output):
if (speech_prob >= threshold) and temp_end:
temp_end = 0
if (speech_prob >= threshold) and not triggered:
triggered = True
current_speech = window_size_samples * i
continue
if (speech_prob < neg_threshold) and triggered:
if not temp_end:
temp_end = window_size_samples * i
if (window_size_samples * i) - temp_end < min_silence_samples:
continue
else:
if temp_end - current_speech > min_speech_samples:
print(current_speech / 16000, temp_end / 16000)
temp_end = 0
triggered = False
continue
And the same in rust with tract (except we're reading the wav from the io.npz generated by the python script).
use ndarray_npy::NpzReader;
use tract_ndarray::Array2;
use tract_onnx::{prelude::*, tract_hir::internal::DimLike};
fn main() -> TractResult<()> {
let window_size_samples = 1536;
let model = onnx()
.model_for_path("../silero-vad-3.1/files/silero_vad.onnx")?
.with_input_names(["input", "h0", "c0"])?
.with_output_names(["output", "hn", "cn"])?
.with_input_fact(
0,
InferenceFact::dt_shape(f32::datum_type(), tvec!(1, window_size_samples)),
)?
.with_input_fact(1, InferenceFact::dt_shape(f32::datum_type(), tvec!(2, 1, 64)))?
.with_input_fact(2, InferenceFact::dt_shape(f32::datum_type(), tvec!(2, 1, 64)))?
.into_optimized()?
.into_runnable()?;
let mut npz = NpzReader::new(std::fs::File::open("../silero-vad-3.1/io.npz")?)?;
let wav: Array2<f32> = npz.by_name("wav.npy")?;
let wav = wav.into_arc_tensor();
let samples = wav.shape()[1];
let mut h = Tensor::zero::<f32>(&[2, 1, 64])?;
let mut c = Tensor::zero::<f32>(&[2, 1, 64])?;
let mut output: Vec<f32> = vec![];
for ix in 0..samples.divceil(window_size_samples) {
let offset = ix * window_size_samples;
let mut x = Tensor::zero::<f32>(&[1, window_size_samples])?;
let chunk_len = (samples - offset).min(window_size_samples);
x.assign_slice(0..chunk_len, &wav, offset..offset + chunk_len, 1)?;
let mut outputs = model.run(tvec!(x, h, c))?;
c = outputs.remove(2).into_tensor();
h = outputs.remove(1).into_tensor();
output.push(outputs[0].as_slice::<f32>()?[1]);
}
let min_silence_duration_ms = 100;
let min_speech_duration_ms = 250;
let threshold = 0.5;
let neg_threshold = 0.35;
let min_silence_samples = min_silence_duration_ms * 16000 / 1000;
let min_speech_samples = min_speech_duration_ms * 16000 / 1000;
let mut triggered = false;
let mut current_speech = 0;
let mut temp_end = 0;
for (ix, speech_prob) in output.into_iter().enumerate() {
if speech_prob >= threshold && temp_end != 0 {
temp_end = 0;
}
if speech_prob >= threshold && !triggered {
triggered = true;
current_speech = window_size_samples * ix;
} else if speech_prob < neg_threshold && triggered {
if temp_end == 0 {
temp_end = window_size_samples * ix;
}
if (window_size_samples * ix) - temp_end >= min_silence_samples {
if temp_end - current_speech > min_speech_samples {
println!("{} {}", current_speech as f32 / 16000., temp_end as f32 / 16000.);
}
temp_end = 0;
triggered = false
}
}
}
Ok(())
}
Hey, this is really great, thank you so much for going extra mile. I am definitely going to try today and will let you know how it goes :).
Works perfectly! Thanks a lot. Closing the issue, since I see you have merged the changes as well. I should mention that incorrect values issue was my own, your code helped there as well though :).
Also, I am interested in converting this as an example, would you like to take a PR for this? I think that should resolve https://github.com/sonos/tract/issues/114
Please do! Indeed, it would be one answer to 114. There are other ways to do streaming (through the pulse system) but this is a perfectly valid approach.
+1, @kali thank for you for the rust example! I use tract for various models, and it's really great! In this case I needed GPU support, so I implemented with Ort, but your example made that translation a lot easier.
I am trying to load a silero vad model from here
Code:
Error:
I am able to load the model successfully if I skip
into_optimized()?
On a separate note, on a successfully loaded model, I am not getting same output values as I get with
utils_vad.py
from silero. I will share the repro for that in a separate issue as soon as I get the chance.