homebrewltd / ichigo

Llama3.1 learns to Listen
Apache License 2.0
1.16k stars 42 forks source link

planning: Ichigo VAD #91

Open dan-homebrew opened 1 week ago

dan-homebrew commented 1 week ago

Goal

Tasklist

Resources

hahuyhoang411 commented 6 days ago

e2e vad: https://github.com/modelscope/FunASR

PodsAreAllYouNeed commented 4 days ago

FunASR is used by huggingface to support the Paraformer STT model, while they use SileroVAD. The FSMN-VAD provided by FunASR could be useful to look into as well. Also the pipeline for FunASR includes VAD and Diarization together with STT which could indeed be very useful.

The VAD handler written by hf using some of the SileroVAD code is quite nice: https://github.com/huggingface/speech-to-speech/blob/93d74ba3bc3ad1a948cc167d7cdb95699e49d867/VAD/vad_handler.py

It includes enhancement as well, which is very useful. We can potentially adapt the handler to support other VADs as well. This can cater to #93 as well.

Current Pipeline Audio -> Ichigo -> TTS

Pipeline using hf/s2s handler Audio -> (VAD -> Enhancement) -> Ichigo -> TTS