MiniCPM-V多卡训练模型infer与单卡不一致

Uooga commented 1 week ago

Describe the bug 多卡数据并行lora微调了一个版本的MiniCPM-V，在测试的时候发现输出结果几乎跟原始没有微调的版本一样，损失函数有正常下降，但是在训练集的测试输出也仿佛是没有微调的版本；怀疑是否是infer命令有问题呢？还请大佬帮忙看一下； P.S.单卡训练的模型可以输出符合预期的效果

训练命令： nproc_per_node=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ NPROC_PER_NODE=$nproc_per_node \ MASTER_PORT=29500 \ swift sft \ --model_type minicpm-v-v2-chat \ --dataset train_minicpm_v_2_0619.jsonl \ --lora_target_modules ALL \ --train_dataset_sample -1 \ --num_train_epochs 8 \ --ddp_find_unused_parameters True \

单卡训练命令： CUDA_VISIBLE_DEVICES=1 swift sft --model_type minicpm-v-v2-chat --dataset train_minicpm_v_2_0619.jsonl --lora_target_modules ALL

infer命令： CUDA_VISIBLE_DEVICES=1 swift export --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200/ --merge_lora true CUDA_VISIBLE_DEVICES=1 swift infer --ckpt_dir output/minicpm-v-v2-chat/v3-20240619-204718/checkpoint-6200-merged --load_dataset_config true --val_dataset val_minicpm_v_2_0619.jsonl --show_dataset_sample -1

tastelikefeet commented 1 week ago

fixed #1197

tastelikefeet commented 1 week ago

需要重新训练下

Uooga commented 5 days ago

你好，我用最新的commit版本去进行训练，出现了新的报错，

当前使用的版本如下

Jintao-Huang commented 23 hours ago

感觉训练 vision encoder部分就会有这个问题

modelscope / swift

MiniCPM-V多卡训练模型infer与单卡不一致 #1191