hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
https://huggingface.co/papers/2403.13372
Apache License 2.0
48.66k stars 5.92k forks source link

qwen2.5-3b 微调 loss为0 grand_norm为nan #7388

Closed ciroimmobile closed 1 month ago

ciroimmobile commented 1 month ago

Reminder

System Info

UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 0%|▏ | 1/249 [00:26<1:47:45, 26.07s/it]{'loss': 60.5735, 'grad_norm': nan, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.12} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.24} 8%|███ | 21/249 [07:23<1:17:48, 20.48s/it]

Reproduction

这是我使用的参数
### model
model_name_or_path: /home/wzp/disk1/GEO_VOT/My_Model_Directory/models--Qwen--Qwen2.5-VL-3B-Instruct/snapshots/1b989f2c63999d7344135894d3cfa8f494116743
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true  # choices: [true, false]
freeze_multi_modal_projector: true  # choices: [true, false]
freeze_language_model: false  # choices: [true, false]
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: city_street_view_extended_grpo_city_only_option_2k
template: qwen2_vl
cutoff_len: 1024
max_samples: 50000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_5_vl-3b/full/sft
logging_steps: 10
save_steps: 100
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

Others

No response

ciroimmobile commented 1 month ago

我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”

nioxinjiang3 commented 1 month ago

我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”

别使用sdpa注意力,使用eager注意力就可以了

ciroimmobile commented 1 month ago

我以为是自定义数据中出现了脏数据的问题,但是当我把数据换成mllm_demo,identity,alpaca_en_demo之后,deepspeed换成deepspeed: examples/deepspeed/ds_z2_config.json之后,仍然出现相同的错误,我的transformer版本是4.49.0 “UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass /home/wzp/.conda/envs/vlm-r1-2.5/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass {'loss': 99.595, 'grad_norm': nan, 'learning_rate': 3.5714285714285718e-06, 'epoch': 0.22} 7%|██▉ | 10/135 [04:46<59:15, 28.45s/it]”

别使用sdpa注意力,使用eager注意力就可以了

请问从哪里修改呢?谢谢!

ciroimmobile commented 1 month ago

发现是pytorch2.5.0的问题,使用pytorch2.5.1即可

Wang-Xiaodong1899 commented 1 month ago

同使用的是pytorch==2.5.1, deepspeed=0.14.5, flash-attn=2.7.4, 还是出现第一个loss:0, grad_norm:NaN

ciroimmobile commented 1 month ago

同使用的是pytorch==2.5.1, deepspeed=0.14.5, flash-attn=2.7.4, 还是出现第一个loss:0, grad_norm:NaN

我是pytorch==2.5.1,deepspeed=0.15.4,flash_attn=2.5.0,transformers=4.49.0,cuda=11.8,数据格式是 { "messages": [ { "content": "<image>xxx", "role": "user" }, { "content": "xxx", "role": "assistant" } ], "images": [ "xxx" ] },

Wang-Xiaodong1899 commented 1 month ago

感谢回复。我发现版本好像问题不大,我是zero-2会出现grad_norm: NaN,但zero-3正常,已经fix。

zkx06111 commented 1 month ago

我在llama factory的docker里面跑为啥也有这个问题啊

weiaicunzai commented 1 month ago

我在llama factory的docker里面跑为啥也有这个问题啊

我也是这个问题。我训练的是qwen2vl-lora-sft.yaml ,就改了模型路径。

weiaicunzai commented 1 month ago

我在llama factory的docker里面跑为啥也有这个问题啊

我修复这个bug了,我之前没安装flash-attention,在docker镜像里,编译一下flash attention就行了,同样的命令,就能跑了。估计就是sdpa的bug,用了flash attention没事了。

StackChan commented 1 month ago

我在llama factory的docker里面跑为啥也有这个问题啊

现在解决了吗

Wintoplay commented 3 weeks ago

我在llama factory的docker里面跑为啥也有这个问题啊

现在解决了吗

It is not solved yet.

Wintoplay commented 3 weeks ago

this fix for me, flash_attn: fa2 in the ymal

Himanshunitrr commented 2 weeks ago

changing flash_attn to fa2 does not work for me. I am getting this while finetuning kimi-vl via LORA.

fengzengfly commented 5 days ago

感谢回复。我发现版本好像问题不大,我是zero-2会出现grad_norm: NaN,但zero-3正常,已经fix。

我也是这个问题,把ds换成z3后grad_norm就正常了,但是我现在第二轮的loss直接就为0了 17%|███████████████████▋ | 5/30 [02:38<11:45, 28.21s/it][INFO|2025-05-06 09:11:33] llamafactory.train.callbacks:143 >> {'loss': 2316.1631, 'learning_rate': 4.7839e-05, 'epoch': 0.48, 'throughput': 690.44} {'loss': 2316.1631, 'grad_norm': 1.0, 'learning_rate': 4.783863644106502e-05, 'epoch': 0.48, 'num_input_tokens_seen': 110848}
33%|███████████████████████████████████████ | 10/30 [04:37<07:48, 23.40s/it][INFO|2025-05-06 09:13:32] llamafactory.train.callbacks:143 >> {'loss': 0.0000, 'learning_rate': 3.9695e-05, 'epoch': 0.95, 'throughput': 775.03} {'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 3.969463130731183e-05, 'epoch': 0.95, 'num_input_tokens_seen': 216672}