[Bug]: NCCL hangs and causes timeout

wjj19950828 commented 5 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.24.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.112-005.ali5000.alios7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 515.105.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
BIOS Vendor ID:                  Intel(R) Corporation
Model name:                      Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
BIOS Model name:                 Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5800.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable: eIBRS with unprivileged eBPF
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.2
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.0
[pip3] onnx-graphsurgeon==0.3.27
[pip3] onnxruntime==1.16.3
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] tritonclient==2.44.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB SYS SYS SYS 0-127       N/A
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    PXB SYS SYS SYS 0-127       N/A
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    SYS PXB SYS SYS 0-127       N/A
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    SYS PXB SYS SYS 0-127       N/A
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS PXB SYS 0-127       N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS PXB SYS 0-127       N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS SYS PXB 0-127       N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS SYS PXB 0-127       N/A
NIC0    PXB PXB SYS SYS SYS SYS SYS SYS  X  SYS SYS SYS
NIC1    SYS SYS PXB PXB SYS SYS SYS SYS SYS  X  SYS SYS
NIC2    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS  X  SYS
NIC3    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

🐛 Describe the bug

from vllm import LLM
import psutil
import random

llm = LLM($MODEL_PATH, trust_remote_code=True, tensor_parallel_size=4)

prompt_token_ids_list = [[random.randint(1, 40000) for _ in range(random.randint(1, 1000))] for _ in range(1000)]

for i in range(0, 10000):
    batch_size = random.randint(1, 64)
    print(f">>> Iteration: {i}, Batch Size: {batch_size}")
    output = llm.generate(prompt_token_ids=prompt_token_ids_list[:batch_size], use_tqdm=False)
    if i % 1000 == 0:
        cpu_percent = psutil.cpu_percent()
        memory_percent = psutil.virtual_memory().percent
        print(f"CPU utilization: {cpu_percent}%")
        print(f"Memory utilization: {memory_percent}%")
        print("=========================================")

print("Done!")

This is an occasional BUG, which will cause the nccl timeout problem shown below

cudagraph and custom all recude are enabled

wjj19950828 commented 5 months ago

In addition to these two switches --disable-custom-all-reduce and --enforce-eager, is there any other solution to solve this problem? Thanks~

youkaichao commented 5 months ago

FYI i recently added a page of debugging tips for this problem: https://docs.vllm.ai/en/latest/getting_started/debugging.html

wjj19950828 commented 5 months ago

FYI i recently added a page of debugging tips for this problem: https://docs.vllm.ai/en/latest/getting_started/debugging.html

Thanks for your reply~

I tried the test code you provided below, and I haven't encountered any related problems so far. This problem does occur accidentally. Do you have any suggestions?

# save it as `test.py` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
# adjust `--nproc-per-node` to the number of GPUs you want to use.
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
data = torch.FloatTensor([1,] * 128).to(f"cuda:{dist.get_rank()}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()

wjj19950828 commented 5 months ago

@youkaichao Could you help me look into this issue? Thanks～

youkaichao commented 4 months ago

when it hangs, do you follow the documentation to find out where (which line) the code is executing?

oliver-li commented 4 months ago

@wjj19950828 I'm facing the same issue. Has this problem been solved?

oliver-li commented 4 months ago

[36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 158046, last enqueued NCCL work: 158046, last completed NCCL work: 158045. [36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [36m(RayWorkerWrapper pid=4009)[0m [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [36m(RayWorkerWrapper pid=4009)[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so) [36m(RayWorkerWrapper pid=4009)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #4: + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so) [36m(RayWorkerWrapper pid=4009)[0m frame #5: + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0) [36m(RayWorkerWrapper pid=4009)[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6) [36m(RayWorkerWrapper pid=4009)[0m [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,171 E 4009 4134] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158046, OpType=GATHER, NumelIn=2867200, NumelOut=0, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [36m(RayWorkerWrapper pid=4009)[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): [36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so) [36m(RayWorkerWrapper pid=4009)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9a81b4c5a2 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9a81b513c0 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9a81b5270c in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #4: + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so) [36m(RayWorkerWrapper pid=4009)[0m frame #5: + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0) [36m(RayWorkerWrapper pid=4009)[0m frame #6: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6) [36m(RayWorkerWrapper pid=4009)[0m [36m(RayWorkerWrapper pid=4009)[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): [36m(RayWorkerWrapper pid=4009)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa2ea16b897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so) [36m(RayWorkerWrapper pid=4009)[0m frame #1: + 0xe083a9 (0x7f9a817d73a9 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) [36m(RayWorkerWrapper pid=4009)[0m frame #2: + 0x1c220 (0x7fa2ea2b7220 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so) [36m(RayWorkerWrapper pid=4009)[0m frame #3: + 0x7ea5 (0x7fa2fe195ea5 in /usr/lib64/libpthread.so.0) [36m(RayWorkerWrapper pid=4009)[0m frame #4: clone + 0x6d (0x7fa2fd7b5b0d in /usr/lib64/libc.so.6) [36m(RayWorkerWrapper pid=4009)[0m [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,182 E 4009 4134] logging.cc:108: Stack trace: [36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x1021b3a) [0x7fa2fa068b3a] ray::operator<<() [36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/ray/_raylet.so(+0x10245f8) [0x7fa2fa06b5f8] ray::TerminateHandler() [36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5ea06) [0x7fa2fd000a06] [36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5ea33) [0x7fa2fd000a33] [36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libstdc++.so.6(+0x5e9f4) [0x7fa2fd0009f4] [36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so(+0xe0845a) [0x7f9a817d745a] c10d::ProcessGroupNCCL::ncclCommWatchdog() [36m(RayWorkerWrapper pid=4009)[0m /usr/local/lib/python3.9/site-packages/torch/lib/libtorch.so(+0x1c220) [0x7fa2ea2b7220] execute_native_thread_routine [36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fa2fe195ea5] startthread [36m(RayWorkerWrapper pid=4009)[0m /usr/lib64/libc.so.6(clone+0x6d) [0x7fa2fd7b5b0d] clone [36m(RayWorkerWrapper pid=4009)[0m [36m(RayWorkerWrapper pid=4009)[0m SIGABRT received at time=1720159248 on cpu 5 [36m(RayWorkerWrapper pid=4009)[0m PC: @ 0x7fa2fd6ed387 (unknown) raise [36m(RayWorkerWrapper pid=4009)[0m @ 0x7fa2fe19d630 3504 (unknown) [36m(RayWorkerWrapper pid=4009)[0m @ 0x7fa2fd000a06 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m @ 0x7f9968118048 1728156448 (unknown) [36m(RayWorkerWrapper pid=4009)[0m @ 0x7f9abe1a6580 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m @ 0xcde907894810c083 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: SIGABRT received at time=1720159248 on cpu 5 [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,200 E 4009 4134] logging.cc:365: PC: @ 0x7fa2fd6ed387 (unknown) raise [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365: @ 0x7fa2fe19d630 3504 (unknown) [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,201 E 4009 4134] logging.cc:365: @ 0x7fa2fd000a06 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,202 E 4009 4134] logging.cc:365: @ 0x7f9968118048 1728156448 (unknown) [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,214 E 4009 4134] logging.cc:365: @ 0x7f9abe1a6580 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m [2024-07-05 14:00:48,215 E 4009 4134] logging.cc:365: @ 0xcde907894810c083 (unknown) (unknown) [36m(RayWorkerWrapper pid=4009)[0m Fatal Python error: Aborted [36m(RayWorkerWrapper pid=4009)[0m [rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList.size()=5 [rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. [rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLEMONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList.size() = 5 SIGABRT received at time=1720159428 on cpu 134 PC: @ 0x7fe6c1e3c387 (unknown) raise @ 0x7fe6c28ec630 1656596208 (unknown) @ ... and at least 1 more frames [2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: SIGABRT received at time=1720159428 on cpu 134 [2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: PC: @ 0x7fe6c1e3c387 (unknown) raise [2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: @ 0x7fe6c28ec630 1656596208 (unknown) [2024-07-05 14:03:48,574 E 2878 4115] logging.cc:365: @ ... and at least 1 more frames Fatal Python error: Aborted

ritesh-intel commented 3 months ago

I'm also facing the same issue,

Do we've a solution to this? Few other isses I read where they mentioned to install nvidia-fabricmanager.service, but even after installing that the fabricmanager service was not getting started https://forums.developer.nvidia.com/t/nvidia-fabricmanager-running-error-with-nv-warn-nothing-to-do/272899 Following above forum, I found that for NVSwitch we require to use fabricmanager for better intercommunication between nvidia GPUs.

I'm having 8 nvidia GPUs on my system.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

freckletonj commented 1 day ago

vllm serve hangs for me:

$ vllm serve Mistral-Nemo-Instruct-2407-Q6_K_L.gguf --max-model-len=200000 --pipeline-parallel-size=2

...
INFO 11-10 17:05:20 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-10 17:05:23 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 utils.py:1008] Found nccl from library libnccl.so.2
INFO 11-10 17:05:23 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=49742) INFO 11-10 17:05:23 pynccl.py:63] vLLM is using nccl==2.20.5

<HANGS FOR 20+ min>

<------------------------------------------------------------------------------>
^C <NOTHING ON 1st Ctrl-C>

<------------------------------------------------------------------------------>
^C  <On 2nd Ctrl-C, still hangs but dumps this>
[rank0]: Traceback (most recent call last):
[rank0]:   File "project/.env/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
[rank0]:     return loop.run_until_complete(wrapper())
[rank0]:   File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
[rank0]:   File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
[rank0]:   File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
[rank0]:   File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
[rank0]:   File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
[rank0]:   File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
[rank0]:   File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
[rank0]:   File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
[rank0]: KeyboardInterrupt

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "project/.env/bin/vllm", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "project/.env/lib/python3.10/site-packages/vllm/scripts.py", line 195, in main
[rank0]:     args.dispatch_function(args)
[rank0]:   File "project/.env/lib/python3.10/site-packages/vllm/scripts.py", line 41, in serve
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "project/.env/lib/python3.10/site-packages/uvloop/__init__.py", line 88, in run
[rank0]:     loop.run_until_complete(
[rank0]:   File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
[rank0]:   File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
[rank0]:   File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
[rank0]:   File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
[rank0]:   File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
[rank0]:   File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
[rank0]:   File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
[rank0]:   File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
[rank0]: KeyboardInterrupt

<STILL HANGING>

<------------------------------------------------------------------------------>
^C <On 3rd Ctrl-C, finally terminates with this>

Exception ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1537, in _shutdown
Exception in thread Thread-3 (_do_shutdown):
Traceback (most recent call last):
  File "uvloop/loop.pyx", line 3254, in uvloop.loop.Loop._do_shutdown
    atexit_call()
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
  File "uvloop/loop.pyx", line 1290, in uvloop.loop.Loop.call_soon_threadsafe
  File "uvloop/loop.pyx", line 673, in uvloop.loop.Loop._append_ready_handle
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1096, in join
  File "uvloop/loop.pyx", line 705, in uvloop.loop.Loop._check_closed
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
RuntimeError: Event loop is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    if lock.acquire(block, timeout):
KeyboardInterrupt: 
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "uvloop/loop.pyx", line 3256, in uvloop.loop.Loop._do_shutdown
  File "uvloop/loop.pyx", line 1290, in uvloop.loop.Loop.call_soon_threadsafe
  File "uvloop/loop.pyx", line 673, in uvloop.loop.Loop._append_ready_handle
  File "uvloop/loop.pyx", line 705, in uvloop.loop.Loop._check_closed
RuntimeError: Event loop is closed
INFO 11-10 17:09:01 multiproc_worker_utils.py:120] Killing local vLLM worker processes

youkaichao commented 23 hours ago

@freckletonj did you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out more information?

freckletonj commented 23 hours ago

@youkaichao I did thanks!

it is loading locally, not downloading, and I'm sure
it is loading from NVME, not HDD. Loading to 1 GPU takes a minute or 2. Loading to 2 gpus hangs.
the whole model fits comfortably in 1 gpu, and I'm trying to achieve pipeline-parallel with 2 gpus. If I do not use pipeline-parallel, it works just fine
I have not enabled more logging yet
I have tried --enforce-eager and --disable-custom-all-reduce
hardware/drivers debug script:

Testing 1 GPU:

$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 t01_vllm_sandbox.py 
omega:52926:52926 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:52926:52926 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:52926:52926 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
omega:52926:52973 [0] NCCL INFO NET/IB : No device found.
omega:52926:52973 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:52926:52973 [0] NCCL INFO Using non-device net plugin version 0
omega:52926:52973 [0] NCCL INFO Using network Socket
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 41000 commId 0x7fc24b41a52890dc - Init START
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nRanks 1 nNodes 1 localRanks 1 localRank 0 MNNVL 0
omega:52926:52973 [0] NCCL INFO Channel 00/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 01/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 02/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 03/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 04/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 05/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 06/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 07/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 08/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 09/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 10/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 11/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 12/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 13/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 14/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 15/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 16/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 17/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 18/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 19/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 20/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 21/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 22/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 23/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 24/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 25/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 26/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 27/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 28/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 29/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 30/32 :    0
omega:52926:52973 [0] NCCL INFO Channel 31/32 :    0
omega:52926:52973 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
omega:52926:52973 [0] NCCL INFO P2P Chunksize set to 131072
omega:52926:52973 [0] NCCL INFO Connected all rings
omega:52926:52973 [0] NCCL INFO Connected all trees
omega:52926:52973 [0] NCCL INFO 32 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
omega:52926:52973 [0] NCCL INFO comm 0x634a5715f3e0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 41000 commId 0x7fc24b41a52890dc - Init COMPLETE
PyTorch NCCL is successful!
PyTorch GLOO is successful!
[rank0]: Traceback (most recent call last):
[rank0]:   File "project/t01_vllm_sandbox.py", line 129, in <module>
[rank0]:     pynccl.all_reduce(data, stream=s)
[rank0]:   File "project/.env/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 113, in all_reduce
[rank0]:     assert tensor.device == self.device, (
[rank0]: AttributeError: 'PyNcclCommunicator' object has no attribute 'device'
E1110 17:24:38.480000 126613480374272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 52926) of binary: project/.env/bin/python3
Traceback (most recent call last):
  File "project/.env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "project/.env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "project/.env/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "project/.env/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "project/.env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "project/.env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
t01_vllm_sandbox.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-10_17:24:38
  host      : omega
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 52926)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Testing 2 GPUs on 1 node hangs after this:

$ NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 t01_vllm_sandbox.py 
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] 
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] *****************************************
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1110 17:26:36.056000 124020523139072 torch/distributed/run.py:779] *****************************************
omega:53277:53277 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:53277:53277 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:53277:53277 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.20.5+cuda12.4
omega:53278:53278 [1] NCCL INFO cudaDriverVersion 12060
omega:53278:53278 [1] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:53278:53278 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
omega:53277:53296 [0] NCCL INFO NET/IB : No device found.
omega:53277:53296 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:53277:53296 [0] NCCL INFO Using non-device net plugin version 0
omega:53277:53296 [0] NCCL INFO Using network Socket
omega:53278:53297 [1] NCCL INFO NET/IB : No device found.
omega:53278:53297 [1] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:53278:53297 [1] NCCL INFO Using non-device net plugin version 0
omega:53278:53297 [1] NCCL INFO Using network Socket
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x80e29f72406c67e2 - Init START
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x80e29f72406c67e2 - Init START
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
omega:53277:53296 [0] NCCL INFO Channel 00/04 :    0   1
omega:53278:53297 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
omega:53277:53296 [0] NCCL INFO Channel 01/04 :    0   1
omega:53277:53296 [0] NCCL INFO Channel 02/04 :    0   1
omega:53278:53297 [1] NCCL INFO P2P Chunksize set to 131072
omega:53277:53296 [0] NCCL INFO Channel 03/04 :    0   1
omega:53277:53296 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
omega:53277:53296 [0] NCCL INFO P2P Chunksize set to 131072
omega:53278:53297 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
omega:53277:53296 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
omega:53278:53297 [1] NCCL INFO Connected all rings
omega:53278:53297 [1] NCCL INFO Connected all trees
omega:53278:53297 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:53278:53297 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:53277:53296 [0] NCCL INFO Connected all rings
omega:53277:53296 [0] NCCL INFO Connected all trees
omega:53277:53296 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:53277:53296 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:53277:53296 [0] NCCL INFO comm 0x557964d7cb80 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x80e29f72406c67e2 - Init COMPLETE
omega:53278:53297 [1] NCCL INFO comm 0x64a2ffdb59c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x80e29f72406c67e2 - Init COMPLETE

System Details:

$ nvidia-smi
Sun Nov 10 17:29:31 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+

$ python

>>> vllm.__version__
'0.6.3.post1'

>>> torch.__version__
'2.4.0+cu121'

freckletonj commented 23 hours ago

I will say too, I got it to do --pipeline-parallel-size=2 one time successfully, and my tok/s nearly doubled, so I think it was working that one time. Nothing change between runs, so I'm lost!

youkaichao commented 23 hours ago

NCCL_DEBUG=TRACE torchrun --nproc-per-node=1 t01_vllm_sandbox.py

running the script with --nproc-per-node=1 is not meaningful.

NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 t01_vllm_sandbox.py

it hangs when you run 2 gpus tests, which means your gpu communication can be broken. you need to contact your admin. possibly it is hardware/driver issue.

freckletonj commented 23 hours ago

@youkaichao wow I appreciate the prompt responses!

I was following the debugging documentation, and just plugged in my # gpus, but I am indeed on 1 node:

NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py

I'm on 1 node with 2 GPUs, and it's my local machine, so, I'm the admin :sweat_smile:

Is this not the right way to run for 1 node + 2 gpus?

vllm serve model.gguf --pipeline-parallel-size=2

youkaichao commented 23 hours ago

sorry I mean "it hangs when you run 2 gpus tests"

youkaichao commented 23 hours ago

vllm serve model.gguf --pipeline-parallel-size=2

the command is correct, but your gpus cannot talk to each other.

freckletonj commented 23 hours ago

I'll continue to debug thank you! :pray:

freckletonj commented 3 hours ago

Just keepin a bug log if it helps someone in the future.

tl;dr still doesn't work reliably

I upgraded NCCL, restarted the machine, and ran with pipeline-parallel-size=2 and it worked on the first try.

BUT, then I terminated and tried again, and it hung at the usual spot, pinning 3 CPU threads at 100% indefinitely.

I noticed it was not picking up the new version of nccl (2.23.4), just the original old version (2.20.5).

I don't know if vllm bakes in those dependencies at install time, so I tried:

pip uninstall vllm torch
pip install --no-cache-dir vllm torch

The dep tree is in a weird state, it tried installing decreasing versions of vllm and none were compatible, so, I blew out the whole env and installed everything from scratch, --no-cache-dir.

Again, it can't find compatible versions. In a fresh env:

$ rm -r .env
$ python3 -m venv .env
$ source .env/bin/activate
$ pip install --no-cache-dir torch matplotlib numpy vllm transformers

<downloads a ton of versions of vllm, none work>

INFO: pip is looking at multiple versions of vllm to determine which version is compatible with other requirements. This could take a while.

Then I blow out the env again, and try just vllm on its own:

pip install --no-cache-dir vllm

That works. And then I can install successive packages successfully too.

But when I do vllm serve ... it still picks up the old nccl version. Torch picks up the wrong one too:

>>> torch.cuda.nccl.version()
(2, 20, 5)

If I do dpkg -L libnccl2 I see it's the new version, but torch still gets an old version, so I guess it's installing with its own version?

I notice pip install vllm forces install of the older torch==2.4.

The vllm main branch's pyproject.toml says it wants torch==2.5, so then I tried git clone https://github.com/vllm-project/vllm and pip install -e vllm (takes ~10-15 min).

Now has an nccl version upgrade (finally!):

>>> import torch
>>> torch.cuda.nccl.version()
(2, 21, 5)

vllm serve ..., it picks up the new nccl version.

The first time running this, it successfully ran, using both cards.

The next 3 times trying this it hung at the same step.

youkaichao commented 3 hours ago

there are some nccl env vars you can try:

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

notably, export NCCL_DEBUG=TRACE , and tuning NCCL_P2P_LEVEL . setting export NCCL_P2P_DISABLE=1 might work at the cost of communication efficiency.

freckletonj commented 2 hours ago

Thank you!

With NCCL_DEBUG=TRACE, final logs before hang:

(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 selector.py:135] Using Flash Attention backend.
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 11-11 14:17:08 utils.py:960] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 utils.py:960] Found nccl from library libnccl.so.2
INFO 11-11 14:17:08 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=18565) INFO 11-11 14:17:08 pynccl.py:69] vLLM is using nccl==2.21.5
omega:18302:18445 [0] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:18302:18445 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
omega:18302:18445 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
omega:18302:18445 [0] NCCL INFO NET/Plugin: Using internal network plugin.
omega:18302:18445 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
omega:18565:18565 [1] NCCL INFO cudaDriverVersion 12060
omega:18565:18565 [1] NCCL INFO Bootstrap : Using wlp38s0:192.168.1.19<0>
omega:18565:18565 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
omega:18565:18565 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
omega:18565:18565 [1] NCCL INFO NET/Plugin: Using internal network plugin.
omega:18302:18445 [0] NCCL INFO NET/IB : No device found.
omega:18302:18445 [0] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:18302:18445 [0] NCCL INFO Using non-device net plugin version 0
omega:18302:18445 [0] NCCL INFO Using network Socket
omega:18565:18565 [1] NCCL INFO NET/IB : No device found.
omega:18565:18565 [1] NCCL INFO NET/Socket : Using [0]wlp38s0:192.168.1.19<0> [1]virbr0:192.168.122.1<0>
omega:18565:18565 [1] NCCL INFO Using non-device net plugin version 0
omega:18565:18565 [1] NCCL INFO Using network Socket
omega:18302:18445 [0] NCCL INFO ncclCommInitRank comm 0x7a36ae54e760 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x45fa55651721f85 - Init START
omega:18565:18565 [1] NCCL INFO ncclCommInitRank comm 0x5d09b7b06a90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x45fa55651721f85 - Init START
omega:18565:18565 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
omega:18302:18445 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
omega:18565:18565 [1] NCCL INFO comm 0x5d09b7b06a90 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
omega:18302:18445 [0] NCCL INFO comm 0x7a36ae54e760 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
omega:18302:18445 [0] NCCL INFO Channel 00/04 :    0   1
omega:18565:18565 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
omega:18302:18445 [0] NCCL INFO Channel 01/04 :    0   1
omega:18565:18565 [1] NCCL INFO P2P Chunksize set to 131072
omega:18302:18445 [0] NCCL INFO Channel 02/04 :    0   1
omega:18302:18445 [0] NCCL INFO Channel 03/04 :    0   1
omega:18302:18445 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
omega:18302:18445 [0] NCCL INFO P2P Chunksize set to 131072
omega:18565:18565 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC
omega:18302:18445 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC
omega:18565:18565 [1] NCCL INFO Connected all rings
omega:18302:18445 [0] NCCL INFO Connected all rings
omega:18565:18565 [1] NCCL INFO Connected all trees
omega:18302:18445 [0] NCCL INFO Connected all trees
omega:18565:18565 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:18565:18565 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:18302:18445 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
omega:18302:18445 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
omega:18565:18565 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
omega:18302:18445 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
omega:18565:18565 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
omega:18302:18445 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
omega:18565:18565 [1] NCCL INFO ncclCommInitRank comm 0x5d09b7b06a90 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 61000 commId 0x45fa55651721f85 - Init COMPLETE
omega:18302:18445 [0] NCCL INFO ncclCommInitRank comm 0x7a36ae54e760 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 41000 commId 0x45fa55651721f85 - Init COMPLETE

With NCCL_P2P_DISABLE=1 it's worked a couple times in a row thanks!

youkaichao commented 2 hours ago

so the final solution is NCCL_P2P_DISABLE=1 ?

freckletonj commented 2 hours ago

I didn't try tuning NCCL_P2P_LEVEL, and I think, since I'm doing pipeline parallel, I don't actually need gpu communication (?), so if I did I'd still expect problems. Should I revisit tuning that? There's still an underlying bug worth solving, but, hopefully my debug logs help!

With your recommended NCCL_P2P_DISABLE=1, it's successfully launched over multiple restarts, and my tok/s is ~2x, so for my current use case, I'm back in business thank you!

youkaichao commented 2 hours ago

pipeline parallel also needs gpu communication, sending a tensor from one gpu to another.

youkaichao commented 2 hours ago

I'm going to close this issue by https://github.com/vllm-project/vllm/pull/10236 .

vllm-project / vllm

[Bug]: NCCL hangs and causes timeout #5484

Your current environment

🐛 Describe the bug