gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
https://arxiv.org/abs/2408.16725
MIT License
3.09k stars 277 forks source link

require for training code #78

Open CrazyBoyM opened 2 months ago

CrazyBoyM commented 2 months ago

hi, is there any training script or doc for reproduce our great repo? thanks for your reply.

yukiarimo commented 2 months ago

+1

Alone749-i commented 1 month ago

+1

mini-omni commented 1 month ago

hi, currently, due to some limitations, we may not release the training code. But you can refer to: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py, we modify training code based on this.

vra commented 1 month ago

hi, currently, due to some limitations, we may not release the training code. But you can refer to: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py, we modify training code based on this.

Hi @mini-omni ,Thanks for your information, it's really important for reimpl training code. May I kindly require the commit hash of litgpt which your implementation based on? since that repo has been updated these days.

vra commented 1 month ago

For example, When I use litgpt==0.4.12, runlitgpt pretrain --config mini-omni/checkpoint/model_config.yaml, it raise errors:

usage: litgpt [-h] [--config CONFIG] [--print_config[=flags]]
              {download,chat,finetune,finetune_lora,finetune_full,finetune_adapter,finetune_adapter_v2,pretrain,generate,generate_full,generate_adapter,generate_adapter_v2,generate_sequentially,generate_tp,convert_to_litgpt,convert_from_litgpt,convert_pretrained_checkpoint,merge_lora,evaluate,serve}
              ...
error: Validation failed: Key "pretrain.model_name" is required but not included in config object or its value is None

I guess this is caused by the updating of litgpt source code.

miyashita-code commented 1 month ago

+1

mini-omni commented 4 weeks ago

@vra, hi, we started from commit d367a1199a and modified from it.And the model_config may not be compatible.

vra commented 4 weeks ago

@mini-omni 感谢分享,近期在复现中有更多问题,能否麻烦您再回答一下:

  1. 根据论文Table1, Stage1 训练只采用ASR数据,但根据论文Figure3, Stage1 ASR Adapter和TTS Adapter都会训练,所以数据上,是将ASR数据用大模型跑了回答文本,再用TTS模型跑了回答文本的Audio吗,如果是这样,那回答文本和音频分别是用什么大模型和TTS跑的呢
  2. Stage2只训练文本模态,那论文Figure3中Stage2的图中,Audio是什么呢,因为文本对话数据集是没有Audio的
satheeshkola-532 commented 3 weeks ago

@mini-omni better to release the actual training code for mini omni, it's very much helpful to expand for other languages also

mini-omni commented 3 weeks ago

@vra hi,

  1. 训练tts_adapter: tts_text_in --> tts_audio_out. asr_adapter: asr_audio_in --> asr_text_out。其中也可以不用训tts_adapter,在最后阶段用 question_audio_in --> question_audio_out 来训练语音输出能力。
  2. stage2本质是训练 question_audio_in --> answer_text_out,这里的question_audio就是根据question_text来合成的
satheeshkola-532 commented 3 weeks ago

@vra hi,

  1. Train tts_adapter: tts_text_in --> tts_audio_out. asr_adapter: asr_audio_in --> asr_text_out. You can also not train tts_adapter, and use question_audio_in --> question_audio_out to train voice output capabilities in the final stage.
  2. The essence of stage2 is to train question_audio_in --> answer_text_out. Here, question_audio is synthesized based on question_text.

@mini-omni could you please provide the Actual training scripts using litgpt?

mini-omni commented 1 week ago

@miyashita-code hi, due to some limitations, we may not release the exact training code.