planning: Ichigo VAD - Githubissues

dan-homebrew commented 1 week ago

Goal

Remove the need to press the button, detect the voice
- Medium-term
- Enables ambient voice detection
- Enables interruptibility
Small model that has binary classifier for voice activity detection
- Need to determine activity parameter (e.g. how many seconds)
Long-term: advance VAD to a listening model

Tasklist

[ ] Menlo Realtime API
[ ] Cortex Realtime API

Resources

Use SileroVAD https://github.com/snakers4/silero-vad used by both VITA and Huggingface speech2speech model
Huggingface pipeline might be a good reference point https://github.com/huggingface/speech-to-speech

hahuyhoang411 commented 6 days ago

e2e vad: https://github.com/modelscope/FunASR

PodsAreAllYouNeed commented 4 days ago

FunASR is used by huggingface to support the Paraformer STT model, while they use SileroVAD. The FSMN-VAD provided by FunASR could be useful to look into as well. Also the pipeline for FunASR includes VAD and Diarization together with STT which could indeed be very useful.

The VAD handler written by hf using some of the SileroVAD code is quite nice: https://github.com/huggingface/speech-to-speech/blob/93d74ba3bc3ad1a948cc167d7cdb95699e49d867/VAD/vad_handler.py

It includes enhancement as well, which is very useful. We can potentially adapt the handler to support other VADs as well. This can cater to #93 as well.

Current Pipeline Audio -> Ichigo -> TTS

Pipeline using hf/s2s handler Audio -> (VAD -> Enhancement) -> Ichigo -> TTS

homebrewltd / ichigo

planning: Ichigo VAD #91

Goal

Tasklist

Resources