hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.98k stars 4.05k forks source link

PPO的reward model训练卡住 #4904

Open bingkunyao opened 3 months ago

bingkunyao commented 3 months ago

Reminder

System Info

Reproduction

单机8卡训练PPO的reward model的时候卡死。我的prompt输入比较长,所以cutoff_len设为了2600,yaml文件设置如下:

model_name_or_path: /home/deepseekcoder/Deepseek_1

method

stage: rm do_train: true finetuning_type: lora lora_target: all

dataset

dataset: reward_train_set #这个是我自定义的,里面的prompt较长 template: default cutoff_len: 2600 #这个数设为2600,因为prompt比较长 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: /home/deepseekcoder/Deepseek-reward logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 4 gradient_accumulation_steps: 1 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_steps: 0 fp16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

症状:训练进度到2%的时候卡死(似乎只有所有8卡过了一个batch),等待一段时间后,报错: INFO|trainer.py:2128] 2024-07-20 08:01:42,459 >> Running training [INFO|trainer.py:2129] 2024-07-20 08:01:42,459 >> Num examples = 487 [INFO|trainer.py:2130] 2024-07-20 08:01:42,459 >> Num Epochs = 3 [INFO|trainer.py:2131] 2024-07-20 08:01:42,459 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2134] 2024-07-20 08:01:42,459 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:2135] 2024-07-20 08:01:42,459 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2136] 2024-07-20 08:01:42,459 >> Total optimization steps = 48 [INFO|trainer.py:2137] 2024-07-20 08:01:42,567 >> Number of trainable parameters = 144,920,577 2%|███▊ | 1/48 [00:20<16:05, 20.55s/it][rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1 [rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info. [rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 1 W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822593 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822594 closing signal SIGTERM W0720 08:20:54.177000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822595 closing signal SIGTERM W0720 08:20:54.178000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822597 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822598 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822599 closing signal SIGTERM W0720 08:20:54.179000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2822600 closing signal SIGTERM E0720 08:21:01.305000 139855166007104 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 3 (pid: 2822596) of binary: /home/ybk1996/miniconda3/bin/python Traceback (most recent call last): File "/home/ybk1996/miniconda3/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ybk1996/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/ybk1996/LLaMA-Factory-newest/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-20_08:20:54 host : a800-node12 rank : 3 (local_rank: 3) exitcode : -6 (pid: 2822596) error_file: traceback : Signal 6 (SIGABRT) received by PID 2822596 ============================================================ 但是当把cutoff_len调至1024和2048的时候,就不会卡死,训练可以正常完成。 这个问题要怎么解决?谢谢作者~ ### Expected behavior _No response_ ### Others _No response_
DarkJokers commented 1 month ago

请问解决了吗

GoGoZeppeli-towa commented 3 weeks ago

同遇到该问题,请问解决了么?