gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
https://arxiv.org/abs/2408.16725
MIT License
3.06k stars 273 forks source link

Architecture difference from technical report? #67

Closed sphmel closed 1 month ago

sphmel commented 1 month ago

in technical report, architecture of TTS Adapter is additional 6 transformer blocks,

image

but in config in link, can not see such architecture. maybe post_adapter?

but no post_adapter in config and checkpoint.

mini-omni commented 1 month ago

Hi, @seaplus296 you are right, post_adapter is the tts_adapter. But the open source version do not include tts_adapter.

sphmel commented 1 month ago

@mini-omni Then, is paper version has 30-layer transformer for tts-output? which means do not sharing lm head between code vocab and lm vocab?

sphmel commented 1 month ago

By the way, I think it would be good to notify that model here is not reported in technical report if it's different....

mini-omni commented 1 month ago

@mini-omni Then, is paper version has 30-layer transformer for tts-output? which means do not sharing lm head between code vocab and lm vocab?

yes, we add extra 6-layer for tts-adapter, and they do not share lm_head (even for the version without tts-adapter)

mini-omni commented 1 month ago

By the way, I think it would be good to notify that model here is not reported in technical report if it's different....

thanks for your suggestion, I'll add it in the README.

sphmel commented 1 month ago

@mini-omni Thanks. Finally is there any performance difference between report's architecture and open-source version? Why architecture changed?

mini-omni commented 1 month ago

@mini-omni Thanks. Finally is there any performance difference between report's architecture and open-source version? Why architecture changed?

The open-source version is the one used for synchronous development and experimental comparison. In our subjective evaluation, the version with the TTS-adapter has less impact on text processing capabilities.

sphmel commented 1 month ago

@mini-omni Thank you for kind replies.

sphmel commented 1 month ago

@mini-omni sorry, i have some follow-up question. If open-source version's tts adapter is just vocab expansion how it is trained in stage 1? only expanded parameters?