Error during Finetune MiniCPM 微调MiniCPM时出错

youjiaSHTU commented 1 month ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图) Training Command is:

NPROC_PER_NODE=8 \
swift sft \
    --model_type minicpm-v-v2_5-chat \
    --dataset xxxxxx\
    --sft_type full \
    --model_id_or_path /storage/group/xxxx/xxxxx/models/MiniCPM/MiniCPM-Llama3-V-2_5/ \
    --deepspeed default-zero3 \
    --num_train_epochs 5 \
    --system "xxxxxxxxx"\
    --eval_steps 300\
    --save_steps 300

Also tried --deepspeed default-zero2.

Error:

[rank4]:     patch_embeds = self.patch_embedding(pixel_values)
[rank4]:   File "/public/home/xxxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank4]:     return self._call_impl(*args, **kwargs)
[rank4]:   File "/public/home/xxxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank4]:     return forward_call(*args, **kwargs)
[rank4]:   File "/public/home/xxxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
[rank4]:     return self._conv_forward(input, self.weight, self.bias)
[rank4]:   File "/public/home/xxxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
[rank4]:     return F.conv2d(input, weight, bias, self.stride,
[rank4]: RuntimeError: weight should have at least three dimensions

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

Training on A100, Cuda Version is 12.1

Additional context Add any other context about the problem here(在这里补充其他信息)

I've tried MiniCPM deploy, inference, and it's all fine. CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type minicpm-v-v2_5-chat --dtype bf16 --model_id_or_path /storage/group/xxxx/xxxxx/models/MiniCPM/MiniCPM-Llama3-V-2_5/

Jintao-Huang commented 1 month ago

NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model_type minicpm-v-v2_5-chat \
    --dataset coco-en-2-mini \
    --sft_type full \
    --deepspeed default-zero2

I am running normally here, below is the log.

{'loss': 3.8767333, 'acc': 0.38549057, 'grad_norm': 3773.15991211, 'learning_rate': 0.0, 'memory(GiB)': 71.72, 'train_speed(iter/s)': 0.013345, 'epoch': 0.0, 'global_step': 1}
{'loss': 3.14160919, 'acc': 0.40862948, 'grad_norm': 410.01010132, 'learning_rate': 3.33e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.052893, 'epoch': 0.0, 'global_step': 5}
{'loss': 2.67884235, 'acc': 0.47376723, 'grad_norm': 166.06022644, 'learning_rate': 4.76e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.085535, 'epoch': 0.0, 'global_step': 10}
{'loss': 2.53729782, 'acc': 0.49073067, 'grad_norm': 75.64689636, 'learning_rate': 5.6e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.106337, 'epoch': 0.01, 'global_step': 15}
{'loss': 2.36419754, 'acc': 0.51325231, 'grad_norm': 53.22253799, 'learning_rate': 6.19e-06, 'memory(GiB)': 73.51, 'train_speed(iter/s)': 0.12176, 'epoch': 0.01, 'global_step': 20}
{'loss': 2.36386452, 'acc': 0.51682658, 'grad_norm': 50.49488068, 'learning_rate': 6.66e-06, 'memory(GiB)': 73.51, 'train_speed(iter/s)': 0.130595, 'epoch': 0.01, 'global_step': 25}
{'loss': 2.44698772, 'acc': 0.49451327, 'grad_norm': 51.19428253, 'learning_rate': 7.03e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.139245, 'epoch': 0.01, 'global_step': 30}
{'loss': 2.36808701, 'acc': 0.49122458, 'grad_norm': 44.71856689, 'learning_rate': 7.35e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.146782, 'epoch': 0.01, 'global_step': 35}
{'loss': 2.35102615, 'acc': 0.51686664, 'grad_norm': 42.64341354, 'learning_rate': 7.63e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.152795, 'epoch': 0.02, 'global_step': 40}
{'loss': 2.42719765, 'acc': 0.49990711, 'grad_norm': 43.45627213, 'learning_rate': 7.87e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.157488, 'epoch': 0.02, 'global_step': 45}
{'loss': 2.41096077, 'acc': 0.49881673, 'grad_norm': 42.83742523, 'learning_rate': 8.09e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.161702, 'epoch': 0.02, 'global_step': 50}
Train:   2%|█▌                                                                              | 50/2506 [03:58<3:12:31,  4.70s/it]
{'eval_loss': 2.40315771, 'eval_acc': 0.49941928, 'eval_runtime': 35.7365, 'eval_samples_per_second': 11.333, 'eval_steps_per_second': 2.854, 'epoch': 0.02, 'global_step': 50}
Val: 100%|████████████████████████████████████████████████████████████████████████████████████| 102/102 [00:35<00:00,  2.90it/s]
[INFO:swift] Saving model checkpoint to /xxx/output/minicpm-v-v2_5-chat/v22-20240614-013240/checkpoint-50
{'loss': 2.27495384, 'acc': 0.51329708, 'grad_norm': 46.60443497, 'learning_rate': 8.29e-06, 'memory(GiB)': 76.02, 'train_speed(iter/s)': 0.132763, 'epoch': 0.02, 'global_step': 55}

Jintao-Huang commented 1 month ago

Please try to upgrade ms-swift.

airkid commented 1 month ago

@Jintao-Huang Same issue with me, use zero3 will trigger this error, even with the latest swift version

youjiaSHTU commented 1 month ago

still not work for me. zero2's error:


Time to load fused_adam op: 2.4175093173980713 seconds
Time to load fused_adam op: 2.4176480770111084 seconds
Time to load fused_adam op: 2.4178555011749268 secondsTime to load fused_adam op: 2.417736053466797 seconds

Time to load fused_adam op: 2.4175987243652344 seconds
Time to load fused_adam op: 2.418416738510132 seconds
Time to load fused_adam op: 2.419487476348877 seconds
W0615 22:28:39.542000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44259 closing signal SIGTERM
W0615 22:28:40.155000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44261 closing signal SIGTERM
W0615 22:28:40.157000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44262 closing signal SIGTERM
W0615 22:28:40.158000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44263 closing signal SIGTERM
W0615 22:28:40.159000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44264 closing signal SIGTERM
W0615 22:28:40.161000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44265 closing signal SIGTERM
W0615 22:28:40.162000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44266 closing signal SIGTERM
E0615 22:28:44.440000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 1 (pid: 44260) of binary: /public/home/xxx/miniconda3/envs/swift/bin/python
Traceback (most recent call last):
  File "/public/home/xxx/miniconda3/envs/swift/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/storage/group/xxx/xxx/xxx/swift/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-15_22:28:39
  host      : ai_hgx_01
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 44260)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 44260
============================================================

Ten1o commented 1 month ago

@Jintao-Huang

https://github.com/modelscope/swift/issues/938 Seems to be the similar issue. Could you please further investigate it?

MrToy commented 1 day ago

Same issue

modelscope / swift

Error during Finetune MiniCPM 微调MiniCPM时出错 #1135