Replace existing TTS cascade with a speech decoder that directly generates speech. This change will replace the current TTS cascade which adds latency to ichigo's response time.
Potential Solutions
Llama-Omni uses an upsampling-decoder-vocoder pipeline to generate speech
MaskGCT (slow)
Find some method to reuse WhisperSpeech which is already a TTS model. But it needs to be fine-tunned to our desired single speaker, with high quality generation.
In light of recent development on f5-tts, we can consider slow down on this end temporarily since f5-tts can pretty much helped us accomplish what we need
Goal
Replace existing TTS cascade with a speech decoder that directly generates speech. This change will replace the current TTS cascade which adds latency to ichigo's response time.
Potential Solutions