snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
3.38k stars 353 forks source link

Feature request - 10 or 20ms audio support #429

Closed CerbonXD closed 3 months ago

CerbonXD commented 3 months ago

🚀 Feature

I would like to ask support for voice detection in chunks less than 32ms.

Motivation

In my current project, I wanted to use a Voice Activity Detector to identify when a person is speaking. However, in my context, the audio I receive has 321 samples at 16kHz, which equates to 20ms of audio. Because of that the VAD does not work.

Pitch

Make it compatible with audios less than 32ms if possible.

Alternatives

No alternatives I can think of.

Additional context

I'm using Java

snakers4 commented 3 months ago

Hi,

Generating annotation with a 20ms window is very hard and most likely 30ms is good enough for the majority of applications.

You can try windowing the VAD, i.e. applying it with a 20ms hop in a overlapping pattern.