Open youjiaSHTU opened 1 month ago
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type minicpm-v-v2_5-chat \
--dataset coco-en-2-mini \
--sft_type full \
--deepspeed default-zero2
I am running normally here, below is the log.
{'loss': 3.8767333, 'acc': 0.38549057, 'grad_norm': 3773.15991211, 'learning_rate': 0.0, 'memory(GiB)': 71.72, 'train_speed(iter/s)': 0.013345, 'epoch': 0.0, 'global_step': 1}
{'loss': 3.14160919, 'acc': 0.40862948, 'grad_norm': 410.01010132, 'learning_rate': 3.33e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.052893, 'epoch': 0.0, 'global_step': 5}
{'loss': 2.67884235, 'acc': 0.47376723, 'grad_norm': 166.06022644, 'learning_rate': 4.76e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.085535, 'epoch': 0.0, 'global_step': 10}
{'loss': 2.53729782, 'acc': 0.49073067, 'grad_norm': 75.64689636, 'learning_rate': 5.6e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.106337, 'epoch': 0.01, 'global_step': 15}
{'loss': 2.36419754, 'acc': 0.51325231, 'grad_norm': 53.22253799, 'learning_rate': 6.19e-06, 'memory(GiB)': 73.51, 'train_speed(iter/s)': 0.12176, 'epoch': 0.01, 'global_step': 20}
{'loss': 2.36386452, 'acc': 0.51682658, 'grad_norm': 50.49488068, 'learning_rate': 6.66e-06, 'memory(GiB)': 73.51, 'train_speed(iter/s)': 0.130595, 'epoch': 0.01, 'global_step': 25}
{'loss': 2.44698772, 'acc': 0.49451327, 'grad_norm': 51.19428253, 'learning_rate': 7.03e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.139245, 'epoch': 0.01, 'global_step': 30}
{'loss': 2.36808701, 'acc': 0.49122458, 'grad_norm': 44.71856689, 'learning_rate': 7.35e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.146782, 'epoch': 0.01, 'global_step': 35}
{'loss': 2.35102615, 'acc': 0.51686664, 'grad_norm': 42.64341354, 'learning_rate': 7.63e-06, 'memory(GiB)': 73.88, 'train_speed(iter/s)': 0.152795, 'epoch': 0.02, 'global_step': 40}
{'loss': 2.42719765, 'acc': 0.49990711, 'grad_norm': 43.45627213, 'learning_rate': 7.87e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.157488, 'epoch': 0.02, 'global_step': 45}
{'loss': 2.41096077, 'acc': 0.49881673, 'grad_norm': 42.83742523, 'learning_rate': 8.09e-06, 'memory(GiB)': 73.91, 'train_speed(iter/s)': 0.161702, 'epoch': 0.02, 'global_step': 50}
Train: 2%|█▌ | 50/2506 [03:58<3:12:31, 4.70s/it]
{'eval_loss': 2.40315771, 'eval_acc': 0.49941928, 'eval_runtime': 35.7365, 'eval_samples_per_second': 11.333, 'eval_steps_per_second': 2.854, 'epoch': 0.02, 'global_step': 50}
Val: 100%|████████████████████████████████████████████████████████████████████████████████████| 102/102 [00:35<00:00, 2.90it/s]
[INFO:swift] Saving model checkpoint to /xxx/output/minicpm-v-v2_5-chat/v22-20240614-013240/checkpoint-50
{'loss': 2.27495384, 'acc': 0.51329708, 'grad_norm': 46.60443497, 'learning_rate': 8.29e-06, 'memory(GiB)': 76.02, 'train_speed(iter/s)': 0.132763, 'epoch': 0.02, 'global_step': 55}
Please try to upgrade ms-swift.
@Jintao-Huang Same issue with me, use zero3 will trigger this error, even with the latest swift version
still not work for me. zero2's error:
Time to load fused_adam op: 2.4175093173980713 seconds
Time to load fused_adam op: 2.4176480770111084 seconds
Time to load fused_adam op: 2.4178555011749268 secondsTime to load fused_adam op: 2.417736053466797 seconds
Time to load fused_adam op: 2.4175987243652344 seconds
Time to load fused_adam op: 2.418416738510132 seconds
Time to load fused_adam op: 2.419487476348877 seconds
W0615 22:28:39.542000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44259 closing signal SIGTERM
W0615 22:28:40.155000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44261 closing signal SIGTERM
W0615 22:28:40.157000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44262 closing signal SIGTERM
W0615 22:28:40.158000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44263 closing signal SIGTERM
W0615 22:28:40.159000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44264 closing signal SIGTERM
W0615 22:28:40.161000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44265 closing signal SIGTERM
W0615 22:28:40.162000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 44266 closing signal SIGTERM
E0615 22:28:44.440000 140230454556288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 1 (pid: 44260) of binary: /public/home/xxx/miniconda3/envs/swift/bin/python
Traceback (most recent call last):
File "/public/home/xxx/miniconda3/envs/swift/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/xxx/miniconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/storage/group/xxx/xxx/xxx/swift/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-15_22:28:39
host : ai_hgx_01
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 44260)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 44260
============================================================
@Jintao-Huang
https://github.com/modelscope/swift/issues/938 Seems to be the similar issue. Could you please further investigate it?
Same issue
Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图) Training Command is:
Also tried
--deepspeed default-zero2
.Error:
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Training on A100, Cuda Version is 12.1
Additional context Add any other context about the problem here(在这里补充其他信息)
I've tried MiniCPM deploy, inference, and it's all fine.
CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type minicpm-v-v2_5-chat --dtype bf16 --model_id_or_path /storage/group/xxxx/xxxxx/models/MiniCPM/MiniCPM-Llama3-V-2_5/