Qwen2-vl full sft Heartbeat monitor timed out!

Reminder

[X] I have read the README and searched the existing issues.
System Info

llamafactory version: 0.8.4.dev0
Platform: Linux-5.15.0-72-generic-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-80GB
DeepSpeed version: 0.14.4
Reproduction

code snippets: set -e set -x
source /mnt/data/yuan/.zshrc conda activate qwen2vl
export NCCL_P2P_LEVEL=NVL export NCCL_DEBUG=INFO
cd /mnt/data/yuan/LLaMA-Factory
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml
error messages:
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:298:810 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:298:810 [5] NCCL INFO comm 0x7efe74c4a3e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId a3000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:300:811 [7] NCCL INFO comm 0x7fd090c4a350 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a7000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:296:806 [3] NCCL INFO comm 0x7f5394c4a530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 67000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:294:805 [1] NCCL INFO comm 0x7effecc4a330 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 63000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:299:807 [6] NCCL INFO comm 0x7f3cecc4a3e0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a5000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:295:809 [2] NCCL INFO comm 0x7f2b64c4a9c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 65000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:297:808 [4] NCCL INFO comm 0x7f4388c4a280 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId a1000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
cce-3jkcwj5c-ftb8r2no:293:804 [0] NCCL INFO comm 0x7fa84d1d1c70 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 61000 commId 0x67526bea581fe7a3 - Init COMPLETE
2024-09-04 19:06:57
0%| | 0/15 [00:00<?, ?it/s]/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:06:57
/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
2024-09-04 19:06:57
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
2024-09-04 19:26:32
7%|▋ | 1/15 [00:03<00:53, 3.84s/it] 13%|█▎ | 2/15 [00:05<00:30, 2.35s/it] 20%|██ | 3/15 [00:06<00:22, 1.88s/it][rank0]:[E904 19:26:32.709561627 ProcessGroupNCCL.cpp:1375] [PG 0 (default_pg) Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
2024-09-04 19:26:32
[rank0]:[E904 19:26:32.709623717 ProcessGroupNCCL.cpp:1413] [PG 0 (default_pg) Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=9
2024-09-04 19:36:32
[rank0]:[F904 19:36:32.709946865 ProcessGroupNCCL.cpp:1224] [PG 0 (default_pg) Rank 0] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 9
2024-09-04 19:36:32
W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 294 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 295 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 296 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 297 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 298 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 299 closing signal SIGTERM
2024-09-04 19:36:32
W0904 19:36:32.463000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 300 closing signal SIGTERM
2024-09-04 19:36:33
E0904 19:36:33.742000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 293) of binary: /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/bin/python
2024-09-04 19:36:33
Traceback (most recent call last):
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/bin/torchrun", line 33, in <module>
2024-09-04 19:36:33
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
2024-09-04 19:36:33
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
2024-09-04 19:36:33
return f(*args, **kwargs)
2024-09-04 19:36:33
^^^^^^^^^^^^^^^^^^
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
2024-09-04 19:36:33
run(args)
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
2024-09-04 19:36:33
elastic_launch(
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
2024-09-04 19:36:33
return launch_agent(self._config, self._entrypoint, list(args))
2024-09-04 19:36:33
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-09-04 19:36:33
File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
2024-09-04 19:36:33
raise ChildFailedError(
2024-09-04 19:36:33
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
2024-09-04 19:36:33
============================================================
2024-09-04 19:36:33
/mnt/data/fanpengyuan/LLaMA-Factory/src/llamafactory/launcher.py FAILED
2024-09-04 19:36:33
------------------------------------------------------------
2024-09-04 19:36:33
Failures:
2024-09-04 19:36:33
<NO_OTHER_FAILURES>
2024-09-04 19:36:33
------------------------------------------------------------
2024-09-04 19:36:33
Root Cause (first observed failure):
2024-09-04 19:36:33
[0]:
2024-09-04 19:36:33
time : 2024-09-04_19:36:32
2024-09-04 19:36:33
host : cce-3jkcwj5c-ftb8r2no
2024-09-04 19:36:33
rank : 0 (local_rank: 0)
2024-09-04 19:36:33
exitcode : -6 (pid: 293)
2024-09-04 19:36:33
error_file: <N/A>
2024-09-04 19:36:33
traceback : Signal 6 (SIGABRT) received by PID 293
2024-09-04 19:36:33
============================================================
Expected behavior

I run full sft and get an error, but lora runs fine
Others

No response
hiyouga / LLaMA-Factory

Qwen2-vl full sft Heartbeat monitor timed out! #5361

Reminder

System Info

Reproduction

Expected behavior

Others