Closed ciroimmobile closed 1 month ago
我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”
我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”
别使用sdpa注意力,使用eager注意力就可以了
我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”
别使用sdpa注意力,使用eager注意力就可以了
请问从哪里修改呢?谢谢!
发现是pytorch2.5.0的问题,使用pytorch2.5.1即可
同使用的是pytorch==2.5.1, deepspeed=0.14.5, flash-attn=2.7.4, 还是出现第一个loss:0, grad_norm:NaN
同使用的是pytorch==2.5.1, deepspeed=0.14.5, flash-attn=2.7.4, 还是出现第一个loss:0, grad_norm:NaN
我是pytorch==2.5.1,deepspeed=0.15.4,flash_attn=2.5.0,transformers=4.49.0,cuda=11.8,数据格式是
{ "messages": [ { "content": "<image>xxx", "role": "user" }, { "content": "xxx", "role": "assistant" } ], "images": [ "xxx" ] },
感谢回复。我发现版本好像问题不大,我是zero-2会出现grad_norm: NaN,但zero-3正常,已经fix。
我在llama factory的docker里面跑为啥也有这个问题啊
我在llama factory的docker里面跑为啥也有这个问题啊
我也是这个问题。我训练的是qwen2vl-lora-sft.yaml ,就改了模型路径。
我在llama factory的docker里面跑为啥也有这个问题啊
我修复这个bug了,我之前没安装flash-attention,在docker镜像里,编译一下flash attention就行了,同样的命令,就能跑了。估计就是sdpa的bug,用了flash attention没事了。
我在llama factory的docker里面跑为啥也有这个问题啊
现在解决了吗
我在llama factory的docker里面跑为啥也有这个问题啊
现在解决了吗
It is not solved yet.
this fix for me, flash_attn: fa2 in the ymal
changing flash_attn to fa2 does not work for me. I am getting this while finetuning kimi-vl via LORA.
感谢回复。我发现版本好像问题不大,我是zero-2会出现grad_norm: NaN,但zero-3正常,已经fix。
我也是这个问题,把ds换成z3后grad_norm就正常了,但是我现在第二轮的loss直接就为0了
17%|███████████████████▋ | 5/30 [02:38<11:45, 28.21s/it][INFO|2025-05-06 09:11:33] llamafactory.train.callbacks:143 >> {'loss': 2316.1631, 'learning_rate': 4.7839e-05, 'epoch': 0.48, 'throughput': 690.44}
{'loss': 2316.1631, 'grad_norm': 1.0, 'learning_rate': 4.783863644106502e-05, 'epoch': 0.48, 'num_input_tokens_seen': 110848}
33%|███████████████████████████████████████ | 10/30 [04:37<07:48, 23.40s/it][INFO|2025-05-06 09:13:32] llamafactory.train.callbacks:143 >> {'loss': 0.0000, 'learning_rate': 3.9695e-05, 'epoch': 0.95, 'throughput': 775.03}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 3.969463130731183e-05, 'epoch': 0.95, 'num_input_tokens_seen': 216672}
Reminder
System Info
UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0%|▏ | 1/249 [00:26<1:47:45, 26.07s/it]{'loss': 60.5735, 'grad_norm': nan, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.12} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.24} 8%|███ | 21/249 [07:23<1:17:48, 20.48s/it]
Reproduction
Others
No response