janhq / ichigo

Local realtime voice AI

Apache License 2.0

1.96k stars 99 forks source link

planning: Ichigo Decoder #111

Open PodsAreAllYouNeed opened 4 days ago

PodsAreAllYouNeed commented 4 days ago

Goal

Replace existing TTS cascade with a speech decoder that directly generates speech. This change will replace the current TTS cascade which adds latency to ichigo's response time.

Potential Solutions

Llama-Omni uses an upsampling-decoder-vocoder pipeline to generate speech
MaskGCT (slow)
Find some method to reuse WhisperSpeech which is already a TTS model. But it needs to be fine-tunned to our desired single speaker, with high quality generation.

tikikun commented 1 day ago

In light of recent development on f5-tts, we can consider slow down on this end temporarily since f5-tts can pretty much helped us accomplish what we need