Multimodal dataset: clarification on mix-data, multiple images

VietDunghacker commented 3 weeks ago

Describe the feature I have noticed that not all multimodal available here in ms-swift support multi-image, and if they do, the training code might not support it. It is also the case with mix text-image dataset, sometimes the code will cause an error if I provide mix dataset. Can you please clarify which model supports training with multiple images or mix text-image dataset?

Paste any useful information Paste any useful information, including papers, github links, etc.(请在这里描述其他有用的信息，比如相关的论文地址，github链接等)

Additional context Add any other context or information here(其他信息可以写在这里)

Jintao-Huang commented 3 weeks ago

This depends on whether the original model supports it. You can refer to the example code of the original model on HF/MS for more information.

VietDunghacker commented 3 weeks ago

Still, it does not answer all of my questions. All multimodal models support text only inference, yet not all of your model here supports mix-data training. And even some model support multi-image such as pixtral, it throws error when finetuning.

modelscope / ms-swift

Multimodal dataset: clarification on mix-data, multiple images #2354