k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.11k stars 360 forks source link

VAD segment length cap at around 20s #1136

Open chiiyeh opened 1 month ago

chiiyeh commented 1 month ago

Hi, was playing around with the VAD model and realized that the maximum speech duration is kept to around 20s regardless of the buffer size. Took a look at the code and saw that it is hardcoded in this line:

https://github.com/k2-fsa/sherpa-onnx/blob/de04b3b9bfc6d48a8ac340e00083d9fd5411b81e/sherpa-onnx/csrc/voice-activity-detector.cc#L156C7-L156C29

Would be nice if this can be a parameter that can be modified. My instinct is that the buffer sort of control the maximum duration, but that turns out to be wrong. Not sure if this is the default behaviour for the original silero vad as well.

csukuangfj commented 1 month ago

Not sure if this is the default behaviour for the original silero vad as well.

It is not the default behavior of silero vad.

We add such a constraint since many users complain that the vad gives them a very long segment.

Typically, you won't get a segment more than 20 seconds if there are longer pauses in your audio.


Would be nice if this can be a parameter that can be modified.

We accept PRs to change that. Would you like to contribute?