gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
https://arxiv.org/abs/2408.16725
MIT License
2.47k stars 241 forks source link

Architecture difference from technical report? #67

Closed seaplus296 closed 4 days ago

seaplus296 commented 5 days ago

in technical report, architecture of TTS Adapter is additional 6 transformer blocks,

image

but in config in link, can not see such architecture. maybe post_adapter?

but no post_adapter in config and checkpoint.

mini-omni commented 5 days ago

Hi, @seaplus296 you are right, post_adapter is the tts_adapter. But the open source version do not include tts_adapter.

seaplus296 commented 5 days ago

@mini-omni Then, is paper version has 30-layer transformer for tts-output? which means do not sharing lm head between code vocab and lm vocab?

seaplus296 commented 5 days ago

By the way, I think it would be good to notify that model here is not reported in technical report if it's different....

mini-omni commented 5 days ago

@mini-omni Then, is paper version has 30-layer transformer for tts-output? which means do not sharing lm head between code vocab and lm vocab?

yes, we add extra 6-layer for tts-adapter, and they do not share lm_head (even for the version without tts-adapter)

mini-omni commented 5 days ago

By the way, I think it would be good to notify that model here is not reported in technical report if it's different....

thanks for your suggestion, I'll add it in the README.

seaplus296 commented 4 days ago

@mini-omni Thanks. Finally is there any performance difference between report's architecture and open-source version? Why architecture changed?

mini-omni commented 4 days ago

@mini-omni Thanks. Finally is there any performance difference between report's architecture and open-source version? Why architecture changed?

The open-source version is the one used for synchronous development and experimental comparison. In our subjective evaluation, the version with the TTS-adapter has less impact on text processing capabilities.

seaplus296 commented 4 days ago

@mini-omni Thank you for kind replies.

seaplus296 commented 15 hours ago

@mini-omni sorry, i have some follow-up question. If open-source version's tts adapter is just vocab expansion how it is trained in stage 1? only expanded parameters?