modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.38k stars 385 forks source link

Multimodal dataset: clarification on mix-data, multiple images #2354

Open VietDunghacker opened 3 weeks ago

VietDunghacker commented 3 weeks ago

Describe the feature I have noticed that not all multimodal available here in ms-swift support multi-image, and if they do, the training code might not support it. It is also the case with mix text-image dataset, sometimes the code will cause an error if I provide mix dataset. Can you please clarify which model supports training with multiple images or mix text-image dataset?

Paste any useful information Paste any useful information, including papers, github links, etc.(请在这里描述其他有用的信息,比如相关的论文地址,github链接等)

Additional context Add any other context or information here(其他信息可以写在这里)

Jintao-Huang commented 3 weeks ago

This depends on whether the original model supports it. You can refer to the example code of the original model on HF/MS for more information.

VietDunghacker commented 3 weeks ago

Still, it does not answer all of my questions. All multimodal models support text only inference, yet not all of your model here supports mix-data training. And even some model support multi-image such as pixtral, it throws error when finetuning.