Closed yuan-QAQ closed 1 week ago
llamafactory
code snippets: set -e set -x
source /mnt/data/yuan/.zshrc conda activate qwen2vl
export NCCL_P2P_LEVEL=NVL export NCCL_DEBUG=INFO
cd /mnt/data/yuan/LLaMA-Factory
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml
error messages:
2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:298:810 [5] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:298:810 [5] NCCL INFO comm 0x7efe74c4a3e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId a3000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:300:811 [7] NCCL INFO comm 0x7fd090c4a350 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId a7000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:296:806 [3] NCCL INFO comm 0x7f5394c4a530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 67000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:294:805 [1] NCCL INFO comm 0x7effecc4a330 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 63000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:299:807 [6] NCCL INFO comm 0x7f3cecc4a3e0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId a5000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:295:809 [2] NCCL INFO comm 0x7f2b64c4a9c0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 65000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:297:808 [4] NCCL INFO comm 0x7f4388c4a280 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId a1000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 cce-3jkcwj5c-ftb8r2no:293:804 [0] NCCL INFO comm 0x7fa84d1d1c70 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 61000 commId 0x67526bea581fe7a3 - Init COMPLETE 2024-09-04 19:06:57 0%| | 0/15 [00:00<?, ?it/s]/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:06:57 /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. 2024-09-04 19:06:57 with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 2024-09-04 19:26:32 7%|▋ | 1/15 [00:03<00:53, 3.84s/it] 13%|█▎ | 2/15 [00:05<00:30, 2.35s/it] 20%|██ | 3/15 [00:06<00:22, 1.88s/it][rank0]:[E904 19:26:32.709561627 ProcessGroupNCCL.cpp:1375] [PG 0 (default_pg) Rank 0] First PG on this rank that detected no heartbeat of its watchdog. 2024-09-04 19:26:32 [rank0]:[E904 19:26:32.709623717 ProcessGroupNCCL.cpp:1413] [PG 0 (default_pg) Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=9 2024-09-04 19:36:32 [rank0]:[F904 19:36:32.709946865 ProcessGroupNCCL.cpp:1224] [PG 0 (default_pg) Rank 0] [PG 0 (default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 9 2024-09-04 19:36:32 W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 294 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 295 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.461000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 296 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 297 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 298 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.462000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 299 closing signal SIGTERM 2024-09-04 19:36:32 W0904 19:36:32.463000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 300 closing signal SIGTERM 2024-09-04 19:36:33 E0904 19:36:33.742000 139709801150272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 293) of binary: /mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/bin/python 2024-09-04 19:36:33 Traceback (most recent call last): 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/bin/torchrun", line 33, in <module> 2024-09-04 19:36:33 sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')()) 2024-09-04 19:36:33 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper 2024-09-04 19:36:33 return f(*args, **kwargs) 2024-09-04 19:36:33 ^^^^^^^^^^^^^^^^^^ 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main 2024-09-04 19:36:33 run(args) 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run 2024-09-04 19:36:33 elastic_launch( 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ 2024-09-04 19:36:33 return launch_agent(self._config, self._entrypoint, list(args)) 2024-09-04 19:36:33 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-09-04 19:36:33 File "/mnt/data/fanpengyuan/anaconda3/envs/qwen2vl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent 2024-09-04 19:36:33 raise ChildFailedError( 2024-09-04 19:36:33 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 2024-09-04 19:36:33 ============================================================ 2024-09-04 19:36:33 /mnt/data/fanpengyuan/LLaMA-Factory/src/llamafactory/launcher.py FAILED 2024-09-04 19:36:33 ------------------------------------------------------------ 2024-09-04 19:36:33 Failures: 2024-09-04 19:36:33 <NO_OTHER_FAILURES> 2024-09-04 19:36:33 ------------------------------------------------------------ 2024-09-04 19:36:33 Root Cause (first observed failure): 2024-09-04 19:36:33 [0]: 2024-09-04 19:36:33 time : 2024-09-04_19:36:32 2024-09-04 19:36:33 host : cce-3jkcwj5c-ftb8r2no 2024-09-04 19:36:33 rank : 0 (local_rank: 0) 2024-09-04 19:36:33 exitcode : -6 (pid: 293) 2024-09-04 19:36:33 error_file: <N/A> 2024-09-04 19:36:33 traceback : Signal 6 (SIGABRT) received by PID 293 2024-09-04 19:36:33 ============================================================
I run full sft and get an error, but lora runs fine
No response
I changed the deepspeed in yaml from z3 to z2 and it worked. It seems that some errors occurred when deepspeed split the model.
Reminder
System Info
llamafactory
version: 0.8.4.dev0Reproduction
code snippets: set -e set -x
source /mnt/data/yuan/.zshrc conda activate qwen2vl
export NCCL_P2P_LEVEL=NVL export NCCL_DEBUG=INFO
cd /mnt/data/yuan/LLaMA-Factory
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml
error messages:
Expected behavior
I run full sft and get an error, but lora runs fine
Others
No response