open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Do these rates correspond? The Whisper Adapter outputs at 50 tokens/s, while SNAC_24Hz codes at 12 tokens/s. For comparison, the Moshi Mini encoder/decoder operates at 12.5 tokens/s.
Do these rates correspond? The Whisper Adapter outputs at 50 tokens/s, while SNAC_24Hz codes at 12 tokens/s. For comparison, the Moshi Mini encoder/decoder operates at 12.5 tokens/s.