利用多机多卡NPU部署Qwen2-VL训练混合数据卡死

lizhishan1997 commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

利用Qwen2-VL微调模型，发现如下问题：（1）单机多卡训练图文对或者纯文本，不管是lora或者全量，成功（2）多机多卡训练图文对或者纯文本，不管是lora或者全量，成功（3）单机多卡训练混合数据，lora 7b成功（4）单机多卡训练混合数据，全量微调7b zero3+offload 不成功（5）多机多卡训练混合数据， lora 不成功（6）多机多卡训练混合数据，全量微调 zero3+offload，不成功

不成功的情况下是刚开始训练就卡死

另外，由于每张卡的显存是32G，Zero2训不起来，所以只能用Zero3训练了

Reproduction

![Uploading image.png…]()

...

Expected behavior

No response

Others

No response

hiyouga commented 1 month ago

目前混合数据不支持 zero3

lijiah33 commented 1 month ago

那请教下，如果72B的模型单卡显存不够怎么办？用zero2会OOM吧，无法完整加载一个模型@hiyouga

hiyouga / LLaMA-Factory

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

Reminder

System Info

Reproduction

Expected behavior

Others