Closed wjj19950828 closed 1 hour ago
In addition to these two switches --disable-custom-all-reduce
and --enforce-eager
, is there any other solution to solve this problem? Thanks~
FYI i recently added a page of debugging tips for this problem: https://docs.vllm.ai/en/latest/getting_started/debugging.html
FYI i recently added a page of debugging tips for this problem: https://docs.vllm.ai/en/latest/getting_started/debugging.html
Thanks for your reply~
I tried the test code you provided below, and I haven't encountered any related problems so far. This problem does occur accidentally. Do you have any suggestions?
# save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
# adjust `--nproc-per-node` to the number of GPUs you want to use.
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
data = torch.FloatTensor([1,] * 128).to(f"cuda:{dist.get_rank()}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()
@youkaichao Could you help me look into this issue? Thanks~
when it hangs, do you follow the documentation to find out where (which line) the code is executing?
@wjj19950828 I'm facing the same issue. Has this problem been solved?
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 158046, last enqueued NCCL work: 158046, last completed NCCL work: 158045.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out.
[36m(RayWorkerWrapper pid=4009)[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
[36m(RayWorkerWrapper pid=4009)[0m frame #4:
I'm also facing the same issue,
Do we've a solution to this? Few other isses I read where they mentioned to install nvidia-fabricmanager.service, but even after installing that the fabricmanager service was not getting started https://forums.developer.nvidia.com/t/nvidia-fabricmanager-running-error-with-nv-warn-nothing-to-do/272899 Following above forum, I found that for NVSwitch we require to use fabricmanager for better intercommunication between nvidia GPUs.
I'm having 8 nvidia GPUs on my system.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
vllm serve
hangs for me:
$ vllm serve Mistral-Nemo-Instruct-2407-Q6_K_L.gguf --max-model-len=200000 --pipeline-parallel-size=2
...
INFO 11-10 17:05:20 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-10 17:05:23 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-10 17:05:23 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 pynccl.py:63] vLLM is using nccl==2.20.5
<HANGS FOR 20+ min>
<------------------------------------------------------------------------------>
^C <NOTHING ON 1st Ctrl-C>
<------------------------------------------------------------------------------>
^C <On 2nd Ctrl-C, still hangs but dumps this>
[rank0]: Traceback (most recent call last):
[rank0]: File "project/.env/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
[rank0]: return loop.run_until_complete(wrapper())
[rank0]: File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
[rank0]: File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
[rank0]: File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
[rank0]: File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
[rank0]: File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
[rank0]: File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
[rank0]: File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
[rank0]: File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
[rank0]: File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
[rank0]: File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
[rank0]: KeyboardInterrupt
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "project/.env/bin/vllm", line 8, in <module>
[rank0]: sys.exit(main())
[rank0]: File "project/.env/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
[rank0]: args.dispatch_function(args)
[rank0]: File "project/.env/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
[rank0]: uvloop.run(run_server(args))
[rank0]: File "project/.env/lib/python3.10/site-packages/uvloop/__init__.py", line 88, in run
[rank0]: loop.run_until_complete(
[rank0]: File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
[rank0]: File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
[rank0]: File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
[rank0]: File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
[rank0]: File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
[rank0]: File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
[rank0]: File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
[rank0]: File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
[rank0]: File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
[rank0]: File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
[rank0]: KeyboardInterrupt
<STILL HANGING>
<------------------------------------------------------------------------------>
^C <On 3rd Ctrl-C, finally terminates with this>
Exception ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'>
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1537, in _shutdown
Exception in thread Thread-3 (_do_shutdown):
Traceback (most recent call last):
File "uvloop/loop.pyx", line 3254, in uvloop.loop.Loop._do_shutdown
atexit_call()
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
File "uvloop/loop.pyx", line 1290, in uvloop.loop.Loop.call_soon_threadsafe
File "uvloop/loop.pyx", line 673, in uvloop.loop.Loop._append_ready_handle
t.join()
File "/usr/lib/python3.10/threading.py", line 1096, in join
File "uvloop/loop.pyx", line 705, in uvloop.loop.Loop._check_closed
self._wait_for_tstate_lock()
File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
RuntimeError: Event loop is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
if lock.acquire(block, timeout):
KeyboardInterrupt:
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "uvloop/loop.pyx", line 3256, in uvloop.loop.Loop._do_shutdown
File "uvloop/loop.pyx", line 1290, in uvloop.loop.Loop.call_soon_threadsafe
File "uvloop/loop.pyx", line 673, in uvloop.loop.Loop._append_ready_handle
File "uvloop/loop.pyx", line 705, in uvloop.loop.Loop._check_closed
RuntimeError: Event loop is closed
INFO 11-10 17:09:01 multiproc_worker_utils.py:120] Killing local vLLM worker processes
@freckletonj did you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out more information?
@youkaichao I did thanks!
pipeline-parallel
with 2 gpus. If I do not use pipeline-parallel
, it works just fine--enforce-eager
and --disable-custom-all-reduce
Testing 1 GPU:
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 t01_vllm_sandbox.py
omega:52926:52926 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:52926:52926 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:52926:52926 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
omega:52926:52973 [0] NCCL INFO NET/IB : No device found.
omega:52926:52973 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:52926:52973 [0] NCCL INFO Using non-device net plugin version 0
omega:52926:52973 [0] NCCL INFO Using network Socket
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 41000 commId 0x7fc24b41a52890dc - Init START
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
omega:52926:52973 [0] NCCL INFO Channel 00/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 01/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 02/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 03/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 04/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 05/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 06/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 07/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 08/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 09/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 10/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 11/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 12/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 13/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 14/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 15/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 16/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 17/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 18/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 19/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 20/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 21/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 22/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 23/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 24/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 25/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 26/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 27/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 28/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 29/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 30/32 : 0
omega:52926:52973 [0] NCCL INFO Channel 31/32 : 0
omega:52926:52973 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
omega:52926:52973 [0] NCCL INFO P2P Chunksize set to 131072
omega:52926:52973 [0] NCCL INFO Connected all rings
omega:52926:52973 [0] NCCL INFO Connected all trees
omega:52926:52973 [0] NCCL INFO 32 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 41000 commId 0x7fc24b41a52890dc - Init COMPLETE
PyTorch NCCL is successful!
PyTorch GLOO is successful!
[rank0]: Traceback (most recent call last):
[rank0]: File "project/t01_vllm_sandbox.py", line 129, in <module>
[rank0]: pynccl.all_reduce(data, stream=s)
[rank0]: File "project/.env/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 113, in all_reduce
[rank0]: assert tensor.device == self.device, (
[rank0]: AttributeError: 'PyNcclCommunicator' object has no attribute 'device'
E1110 17:24:38.480000 126613480374272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 52926) of binary: project/.env/bin/python3
Traceback (most recent call last):
File "project/.env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "project/.env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "project/.env/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "project/.env/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "project/.env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "project/.env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
t01_vllm_sandbox.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-10_17:24:38
host : omega
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 52926)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Testing 2 GPUs on 1 node hangs after this:
$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 t01_vllm_sandbox.py
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779]
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] *****************************************
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] *****************************************
omega:53277:53277 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:53277:53277 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:53277:53277 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
omega:53278:53278 [1] NCCL INFO cudaDriverVersion 12060
omega:53278:53278 [1] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:53278:53278 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:53277:53296 [0] NCCL INFO NET/IB : No device found.
omega:53277:53296 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:53277:53296 [0] NCCL INFO Using non-device net plugin version 0
omega:53277:53296 [0] NCCL INFO Using network Socket
omega:53278:53297 [1] NCCL INFO NET/IB : No device found.
omega:53278:53297 [1] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:53278:53297 [1] NCCL INFO Using non-device net plugin version 0
omega:53278:53297 [1] NCCL INFO Using network Socket
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x80e29f72406c67e2 - Init START
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x80e29f72406c67e2 - Init START
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
omega:53277:53296 [0] NCCL INFO Channel 00/04 : 0 1
omega:53278:53297 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
omega:53277:53296 [0] NCCL INFO Channel 01/04 : 0 1
omega:53277:53296 [0] NCCL INFO Channel 02/04 : 0 1
omega:53278:53297 [1] NCCL INFO P2P Chunksize set to 131072
omega:53277:53296 [0] NCCL INFO Channel 03/04 : 0 1
omega:53277:53296 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
omega:53277:53296 [0] NCCL INFO P2P Chunksize set to 131072
omega:53278:53297 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Connected all rings
omega:53278:53297 [1] NCCL INFO Connected all trees
omega:53278:53297 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:53278:53297 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:53277:53296 [0] NCCL INFO Connected all rings
omega:53277:53296 [0] NCCL INFO Connected all trees
omega:53277:53296 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:53277:53296 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x80e29f72406c67e2 - Init COMPLETE
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x80e29f72406c67e2 - Init COMPLETE
System Details:
$ nvidia-smi
Sun Nov 10 17:29:31 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
$ python
>>> vllm.__version__
'0.6.3.post1'
>>> torch.__version__
'2.4.0+cu121'
I will say too, I got it to do --pipeline-parallel-size=2
one time successfully, and my tok/s nearly doubled, so I think it was working that one time. Nothing change between runs, so I'm lost!
NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 t01_vllm_sandbox.py
running the script with --nproc-per-node=1
is not meaningful.
NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 t01_vllm_sandbox.py
it hangs when you run 2 gpus tests, which means your gpu communication can be broken. you need to contact your admin. possibly it is hardware/driver issue.
@youkaichao wow I appreciate the prompt responses!
I was following the debugging documentation, and just plugged in my # gpus, but I am indeed on 1 node:
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
I'm on 1 node with 2 GPUs, and it's my local machine, so, I'm the admin :sweat_smile:
Is this not the right way to run for 1 node + 2 gpus?
vllm serve model.gguf --pipeline-parallel-size=2
sorry I mean "it hangs when you run 2 gpus tests"
vllm serve model.gguf --pipeline-parallel-size=2
the command is correct, but your gpus cannot talk to each other.
I'll continue to debug thank you! :pray:
Just keepin a bug log if it helps someone in the future.
tl;dr still doesn't work reliably
I upgraded NCCL, restarted the machine, and ran with pipeline-parallel-size=2
and it worked on the first try.
BUT, then I terminated and tried again, and it hung at the usual spot, pinning 3 CPU threads at 100% indefinitely.
I noticed it was not picking up the new version of nccl (2.23.4
), just the original old version (2.20.5
).
I don't know if vllm
bakes in those dependencies at install time, so I tried:
pip uninstall vllm torch
pip install --no-cache-dir vllm torch
The dep tree is in a weird state, it tried installing decreasing versions of vllm and none were compatible, so, I blew out the whole env and installed everything from scratch, --no-cache-dir
.
Again, it can't find compatible versions. In a fresh env:
$ rm -r .env
$ python3 -m venv .env
$ source .env/bin/activate
$ pip install --no-cache-dir torch matplotlib numpy vllm transformers
<downloads a ton of versions of vllm, none work>
INFO: pip is looking at multiple versions of vllm to determine which version is compatible with other requirements. This could take a while.
Then I blow out the env again, and try just vllm
on its own:
pip install --no-cache-dir vllm
That works. And then I can install successive packages successfully too.
But when I do vllm serve ...
it still picks up the old nccl
version. Torch picks up the wrong one too:
>>> torch.cuda.nccl.version()
(2, 20, 5)
If I do dpkg -L libnccl2
I see it's the new version, but torch still gets an old version, so I guess it's installing with its own version?
I notice pip install vllm
forces install of the older torch==2.4
.
The vllm main
branch's pyproject.toml
says it wants torch==2.5
, so then I tried git clone https://github.com/vllm-project/vllm
and pip install -e vllm
(takes ~10-15 min).
Now has an nccl version upgrade (finally!):
>>> import torch
>>> torch.cuda.nccl.version()
(2, 21, 5)
vllm serve ...
, it picks up the new nccl version.
The first time running this, it successfully ran, using both cards.
The next 3 times trying this it hung at the same step.
there are some nccl env vars you can try:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
notably, export NCCL_DEBUG=TRACE
, and tuning NCCL_P2P_LEVEL
. setting export NCCL_P2P_DISABLE=1
might work at the cost of communication efficiency.
Thank you!
With NCCL_DEBUG=TRACE
, final logs before hang:
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-11 14:17:08 utils.py:960] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 utils.py:960] Found nccl from library libnccl.so.2
INFO 11-11 14:17:08 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 pynccl.py:69] vLLM is using nccl==2.21.5
omega:18302:18445 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:18302:18445 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
omega:18302:18445 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
omega:18302:18445 [0] NCCL INFO NET/Plugin: Using internal network plugin.
omega:18302:18445 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
omega:18565:18565 [1] NCCL INFO cudaDriverVersion 12060
omega:18565:18565 [1] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:18565:18565 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
omega:18565:18565 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
omega:18565:18565 [1] NCCL INFO NET/Plugin: Using internal network plugin.
omega:18302:18445 [0] NCCL INFO NET/IB : No device found.
omega:18302:18445 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:18302:18445 [0] NCCL INFO Using non-device net plugin version 0
omega:18302:18445 [0] NCCL INFO Using network Socket
omega:18565:18565 [1] NCCL INFO NET/IB : No device found.
omega:18565:18565 [1] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:18565:18565 [1] NCCL INFO Using non-device net plugin version 0
omega:18565:18565 [1] NCCL INFO Using network Socket
omega:18302:18445 [0] NCCL INFO ncclCommInitRank comm 0x7a36ae54e760 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x45fa55651721f85 - Init START
omega:18565:18565 [1] NCCL INFO ncclCommInitRank comm 0x5d09b7b06a90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x45fa55651721f85 - Init START
omega:18565:18565 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
omega:18302:18445 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
omega:18565:18565 [1] NCCL INFO comm 0x5d09b7b06a90 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
omega:18302:18445 [0] NCCL INFO comm 0x7a36ae54e760 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
omega:18302:18445 [0] NCCL INFO Channel 00/04 : 0 1
omega:18565:18565 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
omega:18302:18445 [0] NCCL INFO Channel 01/04 : 0 1
omega:18565:18565 [1] NCCL INFO P2P Chunksize set to 131072
omega:18302:18445 [0] NCCL INFO Channel 02/04 : 0 1
omega:18302:18445 [0] NCCL INFO Channel 03/04 : 0 1
omega:18302:18445 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
omega:18302:18445 [0] NCCL INFO P2P Chunksize set to 131072
omega:18565:18565 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Connected all rings
omega:18302:18445 [0] NCCL INFO Connected all rings
omega:18565:18565 [1] NCCL INFO Connected all trees
omega:18302:18445 [0] NCCL INFO Connected all trees
omega:18565:18565 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:18565:18565 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:18302:18445 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:18302:18445 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:18565:18565 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
omega:18302:18445 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
omega:18565:18565 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
omega:18302:18445 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
omega:18565:18565 [1] NCCL INFO ncclCommInitRank comm 0x5d09b7b06a90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x45fa55651721f85 - Init COMPLETE
omega:18302:18445 [0] NCCL INFO ncclCommInitRank comm 0x7a36ae54e760 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x45fa55651721f85 - Init COMPLETE
With NCCL_P2P_DISABLE=1
it's worked a couple times in a row thanks!
so the final solution is NCCL_P2P_DISABLE=1
?
I didn't try tuning NCCL_P2P_LEVEL
, and I think, since I'm doing pipeline parallel, I don't actually need gpu communication (?), so if I did I'd still expect problems. Should I revisit tuning that? There's still an underlying bug worth solving, but, hopefully my debug logs help!
With your recommended NCCL_P2P_DISABLE=1
, it's successfully launched over multiple restarts, and my tok/s is ~2x, so for my current use case, I'm back in business thank you!
pipeline parallel also needs gpu communication, sending a tensor from one gpu to another.
I'm going to close this issue by https://github.com/vllm-project/vllm/pull/10236 .
Your current environment
🐛 Describe the bug
This is an occasional BUG, which will cause the nccl timeout problem shown below
cudagraph and custom all recude are enabled
related issue:https://github.com/vllm-project/vllm/issues/1726 https://github.com/vllm-project/vllm/issues/5360 https://github.com/vllm-project/vllm/issues/4653 https://github.com/vllm-project/vllm/issues/4430