hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.84k stars 4.3k forks source link

LoRA微调Qwen2-VL-2B时,Loss一直为0,grad_norm为nan #6092

Open Tian-ye1214 opened 4 days ago

Tian-ye1214 commented 4 days ago

Reminder

System Info

Reproduction

[INFO|2024-11-20 17:36:10] modeling_utils.py:3934 >> loading weights file C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4\model.safetensors.index.json

[INFO|2024-11-20 17:36:10] modeling_utils.py:1670 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.

[INFO|2024-11-20 17:36:10] configuration_utils.py:1096 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

[INFO|2024-11-20 17:36:10] modeling_utils.py:1670 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16.

[WARNING|2024-11-20 17:36:10] logging.py:168 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46

[INFO|2024-11-20 17:36:14] modeling_utils.py:4800 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|2024-11-20 17:36:14] modeling_utils.py:4808 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.

[INFO|2024-11-20 17:36:14] configuration_utils.py:1049 >> loading configuration file C:\Users\PC.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4\generation_config.json

[INFO|2024-11-20 17:36:14] configuration_utils.py:1096 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }

[INFO|2024-11-20 17:36:14] logging.py:157 >> Gradient checkpointing enabled.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Using FlashAttention-2 for faster training and inference.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Upcasting trainable params to float32.

[INFO|2024-11-20 17:36:14] logging.py:157 >> Fine-tuning method: LoRA

[INFO|2024-11-20 17:36:14] logging.py:157 >> Found linear modules: v_proj,k_proj,q_proj,o_proj,gate_proj,up_proj,down_proj

[INFO|2024-11-20 17:36:14] logging.py:157 >> trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162

[INFO|2024-11-20 17:36:14] trainer.py:698 >> Using auto half precision backend

[INFO|2024-11-20 17:36:14] trainer.py:2313 >> Running training

[INFO|2024-11-20 17:36:14] trainer.py:2314 >> Num examples = 15

[INFO|2024-11-20 17:36:14] trainer.py:2315 >> Num Epochs = 100

[INFO|2024-11-20 17:36:14] trainer.py:2316 >> Instantaneous batch size per device = 2

[INFO|2024-11-20 17:36:14] trainer.py:2319 >> Total train batch size (w. parallel, distributed & accumulation) = 16

[INFO|2024-11-20 17:36:14] trainer.py:2320 >> Gradient Accumulation steps = 8

[INFO|2024-11-20 17:36:14] trainer.py:2321 >> Total optimization steps = 100

[INFO|2024-11-20 17:36:14] trainer.py:2322 >> Number of trainable parameters = 9,232,384

[INFO|2024-11-20 17:36:38] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.9692e-05, 'epoch': 5.00}

[INFO|2024-11-20 17:37:02] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.8776e-05, 'epoch': 10.00}

[INFO|2024-11-20 17:37:27] logging.py:157 >> {'loss': 0.0000, 'learning_rate': 4.7275e-05, 'epoch': 15.00}

Expected behavior

No response

Others

训练指令为: llamafactory-cli train --stage sft --do_train True --model_name_or_path C:\Users\PC\.cache\huggingface\hub\models--Qwen--Qwen2-VL-2B-Instruct\snapshots\aca78372505e6cb469c4fa6a35c60265b00ff5a4 --preprocessing_num_workers 16 --finetuning_type lora --template qwen2_vl --flash_attn fa2 --dataset_dir data --dataset mllm_demo --cutoff_len 2048 --learning_rate 5e-05 --num_train_epochs 100.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --packing False --report_to none --output_dir saves\Qwen2-VL-2B-Instruct\lora\train_2024-11-20-17-41-13 --bf16 True --plot_loss True --ddp_timeout 180000000 --optim adamw_torch --lora_rank 8 --lora_alpha 16 --lora_dropout 0 ` --lora_target all

Tian-ye1214 commented 4 days ago

补充一下信息,在Linux环境上跑相同的实验,软件依赖配置相同。实验成功运行,loss能稳定下降。