open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
I'm not very familiar with llm training in multi-modality, like picture or audio, is there any instruction for starters or where can I find some tutorial material? Thanks.
I'm not very familiar with llm training in multi-modality, like picture or audio, is there any instruction for starters or where can I find some tutorial material? Thanks.