[Bug]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details) [repeated 6x across cluster]

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here PyTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.31 Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.10.220-209.869.amzn2.x86_64-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 535.183.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7R13 Processor Stepping: 1 CPU MHz: 2861.667 BogoMIPS: 5300.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 3 MiB L1i cache: 3 MiB L2 cache: 48 MiB L3 cache: 384 MiB NUMA node0 CPU(s): 0-47,96-143 NUMA node1 CPU(s): 48-95,144-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Versions of relevant libraries: [pip3] flashinfer==0.0.9+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.0.3 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.43.2 [pip3] triton==2.3.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1@ vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47,96-143 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95,144-191 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95,144-191 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95,144-191 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95,144-191 1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I am running a 8 node Kubernetes cluster with vllm and deploying llama 3.1 405b BF16 model as 4 replicas. 3 of the replicas are running in stable state, the 4th repics keeps restarting with the following error.

Here are my start up commands for leader and worker.

leader command

/vllm-workspace/ray_init.sh leader --ray_cluster_size=$RAY_CLUSTER_SIZE; 
                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct \
                 --tensor-parallel-size 8 --pipeline_parallel_size 2 --max-logprobs 1000 --max-num-batched-tokens 16384 \
                 --enable-chunked-prefill --kv-cache-dtype fp8 --disable-log-stats --gpu-memory-utilization 0.95 \
                 --device cuda --quantization fp8

Worker Command -

/vllm-workspace/ray_init.sh worker --ray_address=$(LEADER_NAME).$(LWS_NAME).$(NAMESPACE).svc.cluster.local

Error Trace :

ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last):
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs)
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 406, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 61, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 233, in _init_workers_ray
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 350, in _run_workers
[rank0]:     self.driver_worker.execute_method(method, *driver_args,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 383, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last):
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks,
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 176, in __init__
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=11305) ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 374, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     return executor(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 923, in ensure_model_parallel_initialized [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     initialize_model_parallel(tensor_model_parallel_size, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 889, in initialize_model_parallel [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     _TP = init_model_parallel_group(group_ranks, [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 732, in init_model_parallel_group [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     return GroupCoordinator( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ [repeated 12x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.pynccl_comm = PyNcclCommunicator( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank( [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382]     raise RuntimeError(f"NCCL error: {error_str}") [repeated 6x across cluster]
(RayWorkerWrapper pid=11934) ERROR 08-27 05:12:18 worker_base.py:382] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details) [repeated 6x across cluster]
Exception ignored in: <function RayGPUExecutorAsync.__del__ at 0x7ff54849edd0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 473, in __del__
    if self.forward_dag is not None:
AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm