The audio is shorter than the generated text and doesn't say the whole thing.

gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.

https://arxiv.org/abs/2408.16725

MIT License

3.06k stars 273 forks source link

The audio is shorter than the generated text and doesn't say the whole thing. #60

Closed Enchante503 closed 1 month ago

Enchante503 commented 1 month ago

The audio is shorter than the generated text and doesn't say the whole thing.

mini-omni is great, is it possible to improve fluency and adjust speaking speed? Is it possible to display the generated text on the demo screen along with the audio?

superFilicos commented 1 month ago

Yes, because the audio data used for training is typically shorter, while the text output tends to be longer, it can easily result in incomplete speech when the text is particularly long.

mini-omni commented 1 month ago

The audio is shorter than the generated text and doesn't say the whole thing.

mini-omni is great, is it possible to improve fluency and adjust speaking speed? Is it possible to display the generated text on the demo screen along with the audio?

For the released model, it is not. But it is possible to improve fluency and adjust speaking speed with more natural dialogue data.
It is possible, we print the output text in server log. But we dont have time to do that for now.