Discrepancy in Token Processing Rates Between Whisper, SNAC_24Hz

gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.

https://arxiv.org/abs/2408.16725

MIT License

3.15k stars 281 forks source link

Discrepancy in Token Processing Rates Between Whisper, SNAC_24Hz #119

Open itsliupeng opened 3 weeks ago

itsliupeng commented 3 weeks ago

Do these rates correspond? The Whisper Adapter outputs at 50 tokens/s, while SNAC_24Hz codes at 12 tokens/s. For comparison, the Moshi Mini encoder/decoder operates at 12.5 tokens/s.

mini-omni commented 2 weeks ago

hi, I think this is just a modeling difference between modules during the modeling process. SNAC uses 12.5 tokens/s mainly to compress higher demands.