vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.76k stars 3.92k forks source link

[Bug]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details) [repeated 6x across cluster] #7896

Closed soumyasmruti closed 4 days ago

soumyasmruti commented 2 weeks ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.31 Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.10.220-209.869.amzn2.x86_64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 535.183.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 2861.667 BogoMIPS: 5300.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Versions of relevant libraries: [pip3] flashinfer==0.0.9+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.2 [pip3] triton==2.3.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1@ vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95,144-191 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I am running a 8 node Kubernetes cluster with vllm and deploying llama 3.1 405b BF16 model as 4 replicas. 3 of the replicas are running in stable state, the 4th repics keeps restarting with the following error.

Here are my start up commands for leader and worker.

leader command

/vllm-workspace/ray_init.sh leader --ray_cluster_size=$RAY_CLUSTER_SIZE; 
                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct \
                 --tensor-parallel-size 8 --pipeline_parallel_size 2 --max-logprobs 1000 --max-num-batched-tokens 16384 \
                 --enable-chunked-prefill --kv-cache-dtype fp8 --disable-log-stats --gpu-memory-utilization 0.95 \
                 --device cuda --quantization fp8

Worker Command -

/vllm-workspace/ray_init.sh worker --ray_address=$(LEADER_NAME).$(LWS_NAME).$(NAMESPACE).svc.cluster.local

Error Trace :

ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last):
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 233, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 350, in _run_workers
[rank0]:     self.driver_worker.execute_method(method, *driver_args,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ [repeated 12x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}") [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details) [repeated 6x across cluster]
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7ff54849edd0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

Before submitting a new issue...

youkaichao commented 2 weeks ago

did you try to follow https://docs.vllm.ai/en/latest/getting_started/debugging.html ?

soumyasmruti commented 4 days ago

This got resolved when I removed the bad instance node. Debug docs didn't help.

youkaichao commented 4 days ago

@soumyasmruti can you share how you find the bad instance node? We'd be happy to expand the doc to help users find if the node is bad.