modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.29k stars 377 forks source link

Error with Qwen2VL When Images Are Missing #2452

Open LukeForeverYoung opened 5 days ago

LukeForeverYoung commented 5 days ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

I encountered an issue while training Qwen2VL with Flash Attention enabled. When the training data contains samples without images, the following error occurs:

RuntimeError: cu_seqlens_q must be on CUDA

It appears that media_inputs['image_grid_thw'] needs to be moved to the appropriate device when there are no images in the sample.

https://github.com/modelscope/ms-swift/blob/78b6f781550aa19281c26ba5b41120156032f349/swift/llm/utils/template.py#L1655C17-L1655C99

Add a media_inputs['image_grid_thw'].to(device) can resolve the issue.

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

Additional context Add any other context about the problem here(在这里补充其他信息)

Jintao-Huang commented 3 days ago

What versions are acclerate and transformers?

LukeForeverYoung commented 3 days ago

What versions are acclerate and transformers?

accelerate 0.34.2 transformers 4.45.0.dev0