bingkunyao commented 3 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.4.dev0
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python version: 3.11.5
PyTorch version: 2.3.1+cu121 (GPU)
Transformers version: 4.42.4
Datasets version: 2.20.0
Accelerate version: 0.32.1
PEFT version: 0.11.1
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-80GB

Reproduction

单机8卡训练PPO的reward model的时候卡死。我的prompt输入比较长，所以cutoff_len设为了2600，yaml文件设置如下：

model_name_or_path: /home/deepseekcoder/Deepseek_1

method

stage: rm do_train: true finetuning_type: lora lora_target: all

dataset

dataset: reward_train_set #这个是我自定义的，里面的prompt较长 template: default cutoff_len: 2600 #这个数设为2600，因为prompt比较长 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /home/deepseekcoder/Deepseek-reward logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 4 gradient_accumulation_steps: 1 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_steps: 0 fp16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

症状：训练进度到2%的时候卡死（似乎只有所有8卡过了一个batch），等待一段时间后，报错： INFO|trainer.py:2128] 2024-07-20 08:01:42,459 >> Running training [INFO|trainer.py:2129] 2024-07-20 08:01:42,459 >> Num examples = 487 [INFO|trainer.py:2130] 2024-07-20 08:01:42,459 >> Num Epochs = 3 [INFO|trainer.py:2131] 2024-07-20 08:01:42,459 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2134] 2024-07-20 08:01:42,459 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:2135] 2024-07-20 08:01:42,459 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2136] 2024-07-20 08:01:42,459 >> Total optimization steps = 48 [INFO|trainer.py:2137] 2024-07-20 08:01:42,567 >> Number of trainable parameters = 144,920,577 2%|███▊ | 1/48 [00:20<16:05, 20.55s/it][rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822593 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822594 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822595 closing signal SIGTERM W0720 08:20:54.178000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822597 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822598 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822599 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822600 closing signal SIGTERM E0720 08:21:01.305000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 3 (pid: 2822596) of binary: /home/ybk1996/miniconda3/bin/python Traceback (most recent call last): File "/home/ybk1996/miniconda3/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/ybk1996/LLaMA-Factory-newest/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-20_08:20:54 host : a800-node12 rank : 3 (local_rank: 3) exitcode : -6 (pid: 2822596) error_file: traceback : Signal 6 (SIGABRT) received by PID 2822596 ============================================================ 但是当把cutoff_len调至1024和2048的时候，就不会卡死，训练可以正常完成。这个问题要怎么解决？谢谢作者~ ### Expected behavior _No response_ ### Others _No response_

DarkJokers commented 1 month ago

请问解决了吗

GoGoZeppeli-towa commented 3 weeks ago

同遇到该问题，请问解决了么？

hiyouga / LLaMA-Factory

PPO的reward model训练卡住 #4904