症状:训练进度到2%的时候卡死(似乎只有所有8卡过了一个batch),等待一段时间后,报错:
INFO|trainer.py:2128] 2024-07-20 08:01:42,459 >> Running training
[INFO|trainer.py:2129] 2024-07-20 08:01:42,459 >> Num examples = 487
[INFO|trainer.py:2130] 2024-07-20 08:01:42,459 >> Num Epochs = 3
[INFO|trainer.py:2131] 2024-07-20 08:01:42,459 >> Instantaneous batch size per device = 4
[INFO|trainer.py:2134] 2024-07-20 08:01:42,459 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2135] 2024-07-20 08:01:42,459 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2136] 2024-07-20 08:01:42,459 >> Total optimization steps = 48
[INFO|trainer.py:2137] 2024-07-20 08:01:42,567 >> Number of trainable parameters = 144,920,577
2%|███▊ | 1/48 [00:20<16:05, 20.55s/it][rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info.
[rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1
W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822593 closing signal SIGTERM
W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822594 closing signal SIGTERM
W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822595 closing signal SIGTERM
W0720 08:20:54.178000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822597 closing signal SIGTERM
W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822598 closing signal SIGTERM
W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822599 closing signal SIGTERM
W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822600 closing signal SIGTERM
E0720 08:21:01.305000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 3 (pid: 2822596) of binary: /home/ybk1996/miniconda3/bin/python
Traceback (most recent call last):
File "/home/ybk1996/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Reminder
System Info
llamafactory
version: 0.8.4.dev0Reproduction
单机8卡训练PPO的reward model的时候卡死。我的prompt输入比较长,所以cutoff_len设为了2600,yaml文件设置如下:
model_name_or_path: /home/deepseekcoder/Deepseek_1
method
stage: rm do_train: true finetuning_type: lora lora_target: all
dataset
dataset: reward_train_set #这个是我自定义的,里面的prompt较长 template: default cutoff_len: 2600 #这个数设为2600,因为prompt比较长 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: /home/deepseekcoder/Deepseek-reward logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 4 gradient_accumulation_steps: 1 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_steps: 0 fp16: true ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
症状:训练进度到2%的时候卡死(似乎只有所有8卡过了一个batch),等待一段时间后,报错: INFO|trainer.py:2128] 2024-07-20 08:01:42,459 >> Running training [INFO|trainer.py:2129] 2024-07-20 08:01:42,459 >> Num examples = 487 [INFO|trainer.py:2130] 2024-07-20 08:01:42,459 >> Num Epochs = 3 [INFO|trainer.py:2131] 2024-07-20 08:01:42,459 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2134] 2024-07-20 08:01:42,459 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:2135] 2024-07-20 08:01:42,459 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2136] 2024-07-20 08:01:42,459 >> Total optimization steps = 48 [INFO|trainer.py:2137] 2024-07-20 08:01:42,567 >> Number of trainable parameters = 144,920,577 2%|███▊ | 1/48 [00:20<16:05, 20.55s/it][rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822593 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822594 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822595 closing signal SIGTERM W0720 08:20:54.178000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822597 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822598 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822599 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822600 closing signal SIGTERM E0720 08:21:01.305000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 3 (pid: 2822596) of binary: /home/ybk1996/miniconda3/bin/python Traceback (most recent call last): File "/home/ybk1996/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/ybk1996/LLaMA-Factory-newest/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures: