modelscope / ms-swift

Use PEFT or Full-parameter to finetune 300+ LLMs or 80+ MLLMs. (Qwen2, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.41k stars 292 forks source link

使用internlm-xcomposer2_5-7b-chat进行dpo训练,数据报错 #1831

Closed RBBB2010 closed 2 weeks ago

RBBB2010 commented 2 weeks ago

首先,非常感谢SWIFT带来的便利应用!

我在使用自己lora SFT并merge之后的模型进行dpo训练,完全按照rlhf数据格式创建自己的数据集,但在collate_fn会报错如下:

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:20<00:00, 4.28 examples/s] Train: 0%| | 0/6 [00:00<?, ?it/s]

[rank1]: Original Traceback (most recent call last):
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank1]: return self.collate_fn(data) [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: TypeError: an integer is required (got type NoneType)

我排查了自己数据的问题,但仍然会报这样的错。于是我把数据集换成了公开数据集--dataset rlaif-v#1000 \,collate_fn仍会报错相同错误: [rank1]: return self.collate_fn(data) [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: TypeError: an integer is required (got type NoneType)

具体观察发现这里k=prompt_labels时会出现ex[k]是NoneType。请问这是什么问题导致的呢?

我的sh脚本如下: swift rlhf \ --rlhf_type dpo \ --model_type internlm-xcomposer2_5-7b-chat \ --model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged \ --ref_model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged \ --dataset rlaif-v#90 \ --dtype bf16 \ --beta 0.1 \ --sft_type lora \ --init_lora_weights 'pissa' \ --use_flash_attn true \ --num_train_epochs 4 \ --gradient_checkpointing true \ --batch_size 2 \

hjh0119 commented 2 weeks ago

fixed in https://github.com/modelscope/ms-swift/pull/1838

RBBB2010 commented 2 weeks ago

您好,感谢您对数据报错问题的帮助。 我现在在dpo微调时又遇到了新的问题,我使用如下脚本,设备是2*A00 80GB,但开始微调会OOM,我已经按照官方文档去使用device map(去掉NPROC_PER_NODE),也尝试过使用deepspeed,都会爆显存。请问有什么解决办法吗?

CUDA_VISIBLE_DEVICES=0,1 \ MASTER_PORT=29500 \ swift rlhf \ --rlhf_type dpo \ --model_type internlm-xcomposer2_5-7b-chat \ --model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged \ --ref_model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged \ --output_dir /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/dpo \ --dataset /swift/data/dpo_demo.json \ --dtype bf16 \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --init_lora_weights 'pissa' \ --lora_rank 128 \ --lora_alpha 256 \ --lora_dropout_p 0.1 \ --lora_target_modules DEFAULT \ --use_flash_attn true \ --num_train_epochs 3 \ --gradient_checkpointing true \ --batch_size 1 \ --learning_rate 1e-6 \ --gradient_accumulation_steps 16 \ --warmup_ratio 0.01 \ --save_total_limit 20 \ --max_length 10240 \ --save_steps 20 \ --eval_steps 20 \ --model_kwargs '{"hd_num": 16}' \

tastelikefeet commented 2 weeks ago

--model_kwargs '{"hd_num": 16}' 降低一下试试

RBBB2010 commented 2 weeks ago

嗯嗯hd_num降低确实可以解决这个问题,但我的图片数据本身分辨率比较高,hd_num降低可能会影响训练效果。 想请问您还有没有别的办法呢?因为这个显存占用确实比我计算下来的显存占用高一些。。。

RBBB2010 commented 2 weeks ago

您好,我在拉取更新后的最新分支进行dpo的训练到一定阶段会报错,请问这是什么原因导致的 Train: 6%|█████████▋ | 40/618 [44:27<10:42:31, 66.70s/it] File "/home/star/miniconda3/envs/zailiu_dpo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i s torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i s torch.Size([6144, 256]). size mismatch for base_model.model.model.layers.0.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode$ is torch.Size([256, 14336]). size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i s torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i s torch.Size([6144, 256]).

后续还很长,最后一直到 size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model is torch.Size([6144, 256]). size mismatch for base_model.model.model.layers.31.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode l is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode l is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode l is torch.Size([256, 14336]). size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]).

RBBB2010 commented 2 weeks ago

在退回上一个版本并手动加入 #1838的修改后可以正常训练