vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.59k stars 4.24k forks source link

[Bug]: An error occurred while using H20 to perform multi machine inference 405B through the ray cluster, causing inference to crash. #9215

Open fu1996 opened 1 week ago

fu1996 commented 1 week ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.4.99 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 GPU 3: NVIDIA H20 GPU 4: NVIDIA H20 GPU 5: NVIDIA H20 GPU 6: NVIDIA H20 GPU 7: NVIDIA H20 Nvidia driver version: 535.161.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual 字节序: Little Endian CPU: 384 在线 CPU 列表: 0-383 厂商 ID: AuthenticAMD BIOS Vendor ID: Advanced Micro Devices, Inc. 型号名称: AMD EPYC 9K84 96-Core Processor BIOS Model name: AMD EPYC 9K84 96-Core Processor CPU 系列: 25 型号: 17 每个核的线程数: 2 每个座的核数: 96 座: 2 步进: 1 Frequency boost: enabled CPU 最大 MHz: 2600.0000 CPU 最小 MHz: 1500.0000 BogoMIPS: 5200.25 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d 虚拟化: AMD-V L1d 缓存: 6 MiB (192 instances) L1i 缓存: 6 MiB (192 instances) L2 缓存: 192 MiB (192 instances) L3 缓存: 768 MiB (24 instances) NUMA 节点: 2 NUMA 节点0 CPU: 0-95,192-287 NUMA 节点1 CPU: 96-191,288-383 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu12==12.3.4.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.35.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pyindex==1.0.9 [pip3] onnx==1.15.0rc2 [pip3] optree==0.10.0 [pip3] pynvml==11.4.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.2.0+e28a256d7 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] torch-tensorrt==2.3.0a0 [pip3] torchdata==0.7.1a0 [pip3] torchtext==0.17.0a0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@4db5176d9758b720b05460c50ace3c01026eb158 vLLM Build Flags: CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PHB PIX NODE SYS SYS SYS SYS 0-95,192-287 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-95,192-287 0 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 96-191,288-383 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX PHB 96-191,288-383 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE PHB PIX 96-191,288-383 1 N/A NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS NIC1 NODE PIX PHB NODE SYS SYS SYS SYS NODE X PHB NODE SYS SYS SYS SYS NIC2 NODE PHB PIX NODE SYS SYS SYS SYS NODE PHB X NODE SYS SYS SYS SYS NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE NIC6 SYS SYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS NODE NODE X PHB NIC7 SYS SYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS NODE NODE PHB X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_bond_0 NIC1: mlx5_bond_1 NIC2: mlx5_bond_2 NIC3: mlx5_bond_3 NIC4: mlx5_bond_4 NIC5: mlx5_bond_5 NIC6: mlx5_bond_6 NIC7: mlx5_bond_7 ```

Model Input Dumps

There are no relevant files, but I have captured the relevant error call stack logs:

INFO 10-10 00:24:11 async_llm_engine.py:174] Added request cmpl-ea7ce76a97a84141911213d6779c3f25-0.
end] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
VM-16-5-centos:39923:78857 [0] NCCL INFO Connected NVLS tree
VM-16-5-centos:39923:78857 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
VM-16-5-centos:39923:78857 [0] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
VM-16-5-centos:39923:78857 [0] NCCL INFO comm 0x56501150c320 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3000 commId 0x1c6daa4f776f8e93 - Init COMPLETE
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 15[7] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 15[7] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 14[6] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 14[6] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 13[5] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 13[5] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 12[4] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 12[4] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 11[3] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 11[3] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 10[2] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 10[2] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 9[1] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 9[1] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 08/1 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
VM-16-5-centos:39923:78896 [0] NCCL INFO Channel 09/1 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA/Shared
[rank0]:[E1010 00:24:11.022425711 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8eefd26d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8ef010cf08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8ea1d533e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8ea1d58600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8ea1d5f2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ea1d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-10 00:24:11 worker_base.py:386] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 10-10 00:24:11 worker_base.py:386] Traceback (most recent call last):
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
ERROR 10-10 00:24:11 worker_base.py:386]     return executor(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 10-10 00:24:11 worker_base.py:386]     output = self.model_runner.execute_model(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 00:24:11 worker_base.py:386]     return func(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_or_intermediate_states = model_executable(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_states, residual = layer(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     hidden_states = self.self_attn(
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 worker_base.py:386]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
ERROR 10-10 00:24:11 worker_base.py:386]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 00:24:11 worker_base.py:386]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
ERROR 10-10 00:24:11 worker_base.py:386]     return F.linear(x, layer.weight, bias)
ERROR 10-10 00:24:11 worker_base.py:386] RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[2024-10-10 00:24:11,686 E 39923 41512] logging.cc:108: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f8eefd26d10 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f8ef010cf08 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f8ea1d533e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f8ea1d58600 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f8ea1d5f2ba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8ea1d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8eefd77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f8ea19eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f8eef4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f8ef10a6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x7f8ef1138850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-10 00:24:11 async_llm_engine.py:57] Engine background task failed
ERROR 10-10 00:24:11 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return_value = task.result()
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
ERROR 10-10 00:24:11 async_llm_engine.py:57]     result = task.result()
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 585, in engine_step
ERROR 10-10 00:24:11 async_llm_engine.py:57]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 254, in step_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output = await self.model_executor.execute_model_async(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 470, in execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await super().execute_model_async(execute_model_req)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await self._driver_execute_model_async(execute_model_req)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 486, in _driver_execute_model_async
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return await self.driver_exec_method("execute_model",
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 10-10 00:24:11 async_llm_engine.py:57]     result = self.fn(*self.args, **self.kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 387, in execute_method
ERROR 10-10 00:24:11 async_llm_engine.py:57]     raise e
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return executor(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output = self.model_runner.execute_model(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_or_intermediate_states = model_executable(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_states, residual = layer(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     hidden_states = self.self_attn(
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
ERROR 10-10 00:24:11 async_llm_engine.py:57]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 00:24:11 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
ERROR 10-10 00:24:11 async_llm_engine.py:57]     return F.linear(x, layer.weight, bias)
ERROR 10-10 00:24:11 async_llm_engine.py:57] RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=<bound method...7f8d7429f970>>)(<Task finishe...TENSOR_OP)`')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f8d7429f970>>)(<Task finishe...TENSOR_OP)`')>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 585, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 254, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 470, in execute_model_async
    return await super().execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 175, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 486, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 387, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 378, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 273, in execute_model
    output = self.model_runner.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 422, in forward
    model_output = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 322, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 245, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 172, in forward
    qkv, _ = self.qkv_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 334, in forward
    output_parallel = self.quant_method.apply(self, input_, bias)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 122, in apply
    return F.linear(x, layer.weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-7179337ad6fe4efaa13236d16aa59ec1-0.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-5b9eabe0d8dc43178d4f3bb359d17737-0.
INFO 10-10 00:24:11 async_llm_engine.py:181] Aborted request cmpl-ea7ce76a97a84141911213d6779c3f25-0.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f86df6e2fe0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f86df6e3c40

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:115: Stack trace: 
 /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x107b84a) [0x7f8d86daf84a] ray::operator<<()
/usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x107ead2) [0x7f8d86db2ad2] ray::TerminateHandler()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f8eef48220c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f8eef482277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f8eef4821fe]
/usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f8ea19eab35] c10d::ProcessGroupNCCL::ncclCommWatchdog()
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8eef4b0253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8ef10a6ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f8ef1138850]

*** SIGABRT received at time=1728491051 on cpu 150 ***
PC: @     0x7f8ef10a89fc  (unknown)  pthread_kill
    @     0x7f8ef1054520  (unknown)  (unknown)
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440: *** SIGABRT received at time=1728491051 on cpu 150 ***
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440: PC: @     0x7f8ef10a89fc  (unknown)  pthread_kill
[2024-10-10 00:24:11,693 E 39923 41512] logging.cc:440:     @     0x7f8ef1054520  (unknown)  (unknown)
Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, cython.cimports.libc.math, PIL._imaging, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._json, markupsafe._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 109)
INFO 10-10 00:24:11 logger.py:36] Received request cmpl-5c3567835cf143ce89abfb6abde149e7-0: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n阅读下面的CONTEXT,并完成TASK<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n<CONTEXT>\n SAIGroup appointed Michael Healy as senior investment partner to focus on healthcare and other sectors. Healy has over a decade of experience investing in high-growth healthcare companies, leading $4 billion+ in transactions.\n</CONTEXT>\n\n<TASK>\n请抽取上面文段中的所有适应症名称,比如肺癌、胃癌、结直肠癌、脑胶质瘤、NHL、特应性皮炎、糖尿病、肥胖等,以列表的形式返回\n- disease_list=["适应症名1", "适应症名2", "适应症名3", ...]\n</TASK>\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n- disease_list=', params: SamplingParams(n=1, best_of=1, presence_penalty=2.0, frequency_penalty=0.2, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=800, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128000, 128006, 9125, 128007, 271, 108414, 17297, 115070, 99465, 91495, 61648, 66913, 128009, 128006, 882, 128007, 271, 27, 99465, 397, 16998, 1953, 896, 21489, 8096, 1283, 5893, 439, 10195, 9341, 8427, 311, 5357, 389, 18985, 323, 1023, 26593, 13, 1283, 5893, 706, 927, 264, 13515, 315, 3217, 26012, 304, 1579, 2427, 19632, 18985, 5220, 11, 6522, 400, 19, 7239, 10, 304, 14463, 627, 524, 99465, 1363, 3203, 7536, 397, 15225, 116602, 18655, 17905, 28190, 17161, 38574, 105363, 56438, 108562, 51611, 111571, 31091, 126900, 30624, 57942, 118, 23706, 234, 5486, 91939, 225, 23706, 234, 5486, 37985, 74245, 57942, 254, 23706, 234, 5486, 108851, 123199, 103706, 114431, 97, 5486, 45, 13793, 5486, 66378, 51611, 34171, 105871, 114052, 5486, 117587, 126017, 103429, 5486, 117178, 91939, 244, 50667, 105610, 45277, 9554, 115707, 32626, 198, 12, 8624, 2062, 29065, 108562, 51611, 111571, 13372, 16, 498, 330, 108562, 51611, 111571, 13372, 17, 498, 330, 108562, 51611, 111571, 13372, 18, 498, 2564, 933, 524, 66913, 397, 128009, 128006, 78191, 128007, 271, 12, 8624, 2062, 28], lora_request: None, prompt_adapter_request: None.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

🐛 Describe the bug

MODEL_PATH="/data/ckpts/405B-instruct"
nohup python -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --swap-space 32 --tensor-parallel-size 16 --served-model-name llama3-1-405B --host 0.0.0.0 --port 8081 --max-num-seqs 1024 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.9 --enforce-eager >> /tmp/model_server_api_pre.log 2>&1 &

Before submitting a new issue...

yudian0504 commented 1 day ago

pip install nvidia-cublas-cu12==12.4.5.8