使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错

RBBB2010 commented 2 weeks ago

首先，非常感谢SWIFT带来的便利应用！

我在使用自己lora SFT并merge之后的模型进行dpo训练，完全按照rlhf数据格式创建自己的数据集，但在collate_fn会报错如下：

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:20<00:00, 4.28 examples/s] Train: 0%| | 0/6 [00:00<?, ?it/s]

[rank1]: Original Traceback (most recent call last):
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank1]: return self.collate_fn(data) [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: TypeError: an integer is required (got type NoneType)

我排查了自己数据的问题，但仍然会报这样的错。于是我把数据集换成了公开数据集--dataset rlaif-v#1000 \，collate_fn仍会报错相同错误： [rank1]: return self.collate_fn(data) [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in [rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features] [rank1]: TypeError: an integer is required (got type NoneType)

具体观察发现这里k=prompt_labels时会出现ex[k]是NoneType。请问这是什么问题导致的呢？

我的sh脚本如下： swift rlhf \ --rlhf_type dpo \ --model_type internlm-xcomposer2_5-7b-chat \ --model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged \ --ref_model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged \ --dataset rlaif-v#90 \ --dtype bf16 \ --beta 0.1 \ --sft_type lora \ --init_lora_weights 'pissa' \ --use_flash_attn true \ --num_train_epochs 4 \ --gradient_checkpointing true \ --batch_size 2 \

hjh0119 commented 2 weeks ago

fixed in https://github.com/modelscope/ms-swift/pull/1838

RBBB2010 commented 2 weeks ago

您好，感谢您对数据报错问题的帮助。我现在在dpo微调时又遇到了新的问题，我使用如下脚本，设备是2*A00 80GB，但开始微调会OOM，我已经按照官方文档去使用device map（去掉NPROC_PER_NODE），也尝试过使用deepspeed，都会爆显存。请问有什么解决办法吗？

CUDA_VISIBLE_DEVICES=0,1 \ MASTER_PORT=29500 \ swift rlhf \ --rlhf_type dpo \ --model_type internlm-xcomposer2_5-7b-chat \ --model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged \ --ref_model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged \ --output_dir /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/dpo \ --dataset /swift/data/dpo_demo.json \ --dtype bf16 \ --beta 0.1 \ --sft_beta 0.1 \ --sft_type lora \ --init_lora_weights 'pissa' \ --lora_rank 128 \ --lora_alpha 256 \ --lora_dropout_p 0.1 \ --lora_target_modules DEFAULT \ --use_flash_attn true \ --num_train_epochs 3 \ --gradient_checkpointing true \ --batch_size 1 \ --learning_rate 1e-6 \ --gradient_accumulation_steps 16 \ --warmup_ratio 0.01 \ --save_total_limit 20 \ --max_length 10240 \ --save_steps 20 \ --eval_steps 20 \ --model_kwargs '{"hd_num": 16}' \

tastelikefeet commented 2 weeks ago

--model_kwargs '{"hd_num": 16}' 降低一下试试

RBBB2010 commented 2 weeks ago

嗯嗯hd_num降低确实可以解决这个问题，但我的图片数据本身分辨率比较高，hd_num降低可能会影响训练效果。想请问您还有没有别的办法呢？因为这个显存占用确实比我计算下来的显存占用高一些。。。

RBBB2010 commented 2 weeks ago

您好，我在拉取更新后的最新分支进行dpo的训练到一定阶段会报错，请问这是什么原因导致的 Train: 6%|█████████▋ | 40/618 [44:27<10:42:31, 66.70s/it] File "/home/star/miniconda3/envs/zailiu_dpo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i s torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i s torch.Size([6144, 256]). size mismatch for base_model.model.model.layers.0.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode$ is torch.Size([256, 14336]). size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i s torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i s torch.Size([6144, 256]).

后续还很长，最后一直到 size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model is torch.Size([6144, 256]). size mismatch for base_model.model.model.layers.31.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode l is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]). size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode l is torch.Size([14336, 256]). size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode l is torch.Size([256, 14336]). size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is torch.Size([4096, 256]).

RBBB2010 commented 2 weeks ago

在退回上一个版本并手动加入 #1838的修改后可以正常训练

modelscope / ms-swift

使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831