Open VietDunghacker opened 2 weeks ago
This depends on whether the original model supports it. You can refer to the example code of the original model on HF/MS for more information.
Still, it does not answer all of my questions. All multimodal models support text only inference, yet not all of your model here supports mix-data training. And even some model support multi-image such as pixtral, it throws error when finetuning.
Describe the feature I have noticed that not all multimodal available here in ms-swift support multi-image, and if they do, the training code might not support it. It is also the case with mix text-image dataset, sometimes the code will cause an error if I provide mix dataset. Can you please clarify which model supports training with multiple images or mix text-image dataset?
Paste any useful information Paste any useful information, including papers, github links, etc.(请在这里描述其他有用的信息,比如相关的论文地址,github链接等)
Additional context Add any other context or information here(其他信息可以写在这里)