[Bug]: An error occurred while using H20 to perform multi machine inference 405B through the ray cluster, causing inference to crash.

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 GPU 3: NVIDIA H20 GPU 4: NVIDIA H20 GPU 5: NVIDIA H20 GPU 6: NVIDIA H20 GPU 7: NVIDIA H20 Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: 架构： x86_64 CPU 运行模式： 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual 字节序： Little Endian CPU: 384 在线 CPU 列表： 0-383 厂商 ID： AuthenticAMD BIOS Vendor ID: Advanced Micro Devices, Inc. 型号名称： AMD EPYC 9K84 96-Core Processor BIOS Model name: AMD EPYC 9K84 96-Core Processor CPU 系列： 25 型号： 17 每个核的线程数： 2 每个座的核数： 96 座： 2 步进： 1 Frequency boost: enabled CPU 最大 MHz： 2600.0000 CPU 最小 MHz： 1500.0000 BogoMIPS： 5200.25 标记： fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d 虚拟化： AMD-V L1d 缓存： 6 MiB (192 instances) L1i 缓存： 6 MiB (192 instances) L2 缓存： 192 MiB (192 instances) L3 缓存： 768 MiB (24 instances) NUMA 节点： 2 NUMA 节点0 CPU： 0-95,192-287 NUMA 节点1 CPU： 96-191,288-383 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu12==12.3.4.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.35.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pyindex==1.0.9 [pip3] onnx==1.15.0rc2 [pip3] optree==0.10.0 [pip3] pynvml==11.4.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.2.0+e28a256d7 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] torch-tensorrt==2.3.0a0 [pip3] torchdata==0.7.1a0 [pip3] torchtext==0.17.0a0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158 vLLM Build Flags: CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PHB PIX NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-95,192-287 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 96-191,288-383 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX PHB 96-191,288-383 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE PHB PIX 96-191,288-383 1 N/A NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS NIC1 NODE PIX PHB NODE SYS SYS SYS SYS NODE X PHB NODE SYS SYS SYS SYS NIC2 NODE PHB PIX NODE SYS SYS SYS SYS NODE PHB X NODE SYS SYS SYS SYS NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE NIC6 SYS SYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS NODE NODE X PHB NIC7 SYS SYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS NODE NODE PHB X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_bond_0 NIC1: mlx5_bond_1 NIC2: mlx5_bond_2 NIC3: mlx5_bond_3 NIC4: mlx5_bond_4 NIC5: mlx5_bond_5 NIC6: mlx5_bond_6 NIC7: mlx5_bond_7 ```

Model Input Dumps

There are no relevant files, but I have captured the relevant error call stack logs:

INFO 10-10 00:24:11 async_llm_engine.py:174] Added request cmpl-ea7ce76a97a84141911213d6779c3f25-0.
end] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Connected NVLS tree
VM-16-5-centos:39923:78857 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
VM-16-5-centos:39923:78857 [0] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
VM-16-5-centos:39923:78857 [0] NCCL INFO comm 0x56501150c320 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3000 commId 0x1c6daa4f776f8e93 - Init COMPLETE
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 15[7] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 15[7] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 14[6] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 14[6] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 13[5] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 13[5] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 12[4] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 12[4] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 11[3] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 11[3] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 10[2] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 10[2] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 9[1] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 9[1] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
[rank0]:[E1010 00:24:11.022425711 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8eefd26d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8ef010cf08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8ea1d533e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8ea1d58600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8ea1d5f2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ea1d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-10 00:24:11 worker_base.py:386] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 10-10 00:24:11 worker_base.py:386] Traceback (most recent call last):
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
ERROR 10-10 00:24:11 worker_base.py:386]     return executor(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 10-10 00:24:11 worker_base.py:386]     output = self.model_runner.execute_model(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 00:24:11 worker_base.py:386]     return func(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_or_intermediate_states = model_executable(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_states, residual = layer(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_states = self.self_attn(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
ERROR 10-10 00:24:11 worker_base.py:386]     return F.linear(x, layer.weight, bias)
ERROR 10-10 00:24:11 worker_base.py:386] RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[2024-10-10 00:24:11,686 E 39923 41512] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8eefd26d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8ef010cf08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8ea1d533e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8ea1d58600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8ea1d5f2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ea1d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f8ea19eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-10 00:24:11 async_llm_engine.py:57] Engine background task failed
ERROR 10-10 00:24:11 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return_value = task.result()
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
ERROR 10-10 00:24:11 async_llm_engine.py:57]     result = task.result()
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 585, in engine_step
ERROR 10-10 00:24:11 async_llm_engine.py:57]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 254, in step_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output = await self.model_executor.execute_model_async(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 470, in execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await super().execute_model_async(execute_model_req)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await self._driver_execute_model_async(execute_model_req)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 486, in _driver_execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await self.driver_exec_method("execute_model",
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 10-10 00:24:11 async_llm_engine.py:57]     result = self.fn(*self.args, **self.kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 387, in execute_method
ERROR 10-10 00:24:11 async_llm_engine.py:57]     raise e
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return executor(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output = self.model_runner.execute_model(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_or_intermediate_states = model_executable(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_states, residual = layer(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_states = self.self_attn(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return F.linear(x, layer.weight, bias)
ERROR 10-10 00:24:11 async_llm_engine.py:57] RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7f8d7429f970>>)(<Task finishe...TENSOR_OP)`')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f8d7429f970>>)(<Task finishe...TENSOR_OP)`')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 585, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 254, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 470, in execute_model_async
    return await super().execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 486, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 387, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
    return F.linear(x, layer.weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-7179337ad6fe4efaa13236d16aa59ec1-0.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-5b9eabe0d8dc43178d4f3bb359d17737-0.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-ea7ce76a97a84141911213d6779c3f25-0.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f86df6e2fe0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f86df6e3c40

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:115: Stack trace: 
 /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x107b84a) [0x7f8d86daf84a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x107ead2) [0x7f8d86db2ad2] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f8eef48220c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f8eef482277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f8eef4821fe]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f8ea19eab35] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8eef4b0253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8ef10a6ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8ef1138850]

*** SIGABRT received at time=1728491051 on cpu 150 ***
PC: @     0x7f8ef10a89fc  (unknown)  pthread_kill
    @     0x7f8ef1054520  (unknown)  (unknown)
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440: *** SIGABRT received at time=1728491051 on cpu 150 ***
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440: PC: @     0x7f8ef10a89fc  (unknown)  pthread_kill
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440:     @     0x7f8ef1054520  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, cython.cimports.libc.math, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 109)
INFO 10-10 00:24:11 logger.py:36] Received request cmpl-5c3567835cf143ce89abfb6abde149e7-0: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n阅读下面的CONTEXT，并完成TASK<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<CONTEXT>\n SAIGroup appointed Michael Healy as senior investment partner to focus on healthcare and other sectors. Healy has over a decade of experience investing in high-growth healthcare companies, leading $4 billion+ in transactions.\n</CONTEXT>\n\n<TASK>\n请抽取上面文段中的所有适应症名称，比如肺癌、胃癌、结直肠癌、脑胶质瘤、NHL、特应性皮炎、糖尿病、肥胖等，以列表的形式返回\n- disease_list=["适应症名1", "适应症名2", "适应症名3", ...]\n</TASK>\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n- disease_list=', params: SamplingParams(n=1, best_of=1, presence_penalty=2.0, frequency_penalty=0.2, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=800, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128000, 128006, 9125, 128007, 271, 108414, 17297, 115070, 99465, 91495, 61648, 66913, 128009, 128006, 882, 128007, 271, 27, 99465, 397, 16998, 1953, 896, 21489, 8096, 1283, 5893, 439, 10195, 9341, 8427, 311, 5357, 389, 18985, 323, 1023, 26593, 13, 1283, 5893, 706, 927, 264, 13515, 315, 3217, 26012, 304, 1579, 2427, 19632, 18985, 5220, 11, 6522, 400, 19, 7239, 10, 304, 14463, 627, 524, 99465, 1363, 3203, 7536, 397, 15225, 116602, 18655, 17905, 28190, 17161, 38574, 105363, 56438, 108562, 51611, 111571, 31091, 126900, 30624, 57942, 118, 23706, 234, 5486, 91939, 225, 23706, 234, 5486, 37985, 74245, 57942, 254, 23706, 234, 5486, 108851, 123199, 103706, 114431, 97, 5486, 45, 13793, 5486, 66378, 51611, 34171, 105871, 114052, 5486, 117587, 126017, 103429, 5486, 117178, 91939, 244, 50667, 105610, 45277, 9554, 115707, 32626, 198, 12, 8624, 2062, 29065, 108562, 51611, 111571, 13372, 16, 498, 330, 108562, 51611, 111571, 13372, 17, 498, 330, 108562, 51611, 111571, 13372, 18, 498, 2564, 933, 524, 66913, 397, 128009, 128006, 78191, 128007, 271, 12, 8624, 2062, 28], lora_request: None, prompt_adapter_request: None.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

🐛 Describe the bug

MODEL_PATH="/data/ckpts/405B-instruct"
nohup python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --swap-space 32 --tensor-parallel-size 16 --served-model-name llama3-1-405B --host 0.0.0.0 --port 8081 --max-num-seqs 1024 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.9 --enforce-eager >> /tmp/model_server_api_pre.log 2>&1 &

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm