[Bug]: Engine iteration timed out. This should never happen!

UPDATE on 2024-05-23

Workaround: Use the --disable-custom-all-reduce flag when starting the vLLM instance. Thanks @ywang96 !

Following is the original post

🐛 Describe the bug

Summary

A model execution thread hangs at _random_sample (vllm/model_executor/layers/sampler.py:292) mysteriously during inference, and the corresponding code at that line is random_samples = random_samples.cpu()

What happened

We upgraded vLLM from v0.3.3 to 0.4.x, but found vLLM occasionally got stuck and refused to serve requests. From the vLLM log, we saw that a request never got finished. After we dug deeper, we found that it was because a thread got stuck during execution.

Your current environment

vLLM was running inside a docker container. The following was collected from inside the container.

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB

Nvidia driver version: 525.85.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
Frequency boost:                 enabled
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity
GPU0     X  NV8 NV8 NV8 PXB PXB NODE    NODE    SYS SYS SYS SYS 0-31,64-95  0
GPU1    NV8  X  NV8 NV8 PXB PXB NODE    NODE    SYS SYS SYS SYS 0-31,64-95  0
GPU2    NV8 NV8  X  NV8 NODE    NODE    PXB PXB SYS SYS SYS SYS 0-31,64-95  0
GPU3    NV8 NV8 NV8  X  NODE    NODE    PXB PXB SYS SYS SYS SYS 0-31,64-95  0
NIC0    PXB PXB NODE    NODE     X  PIX NODE    NODE    SYS SYS SYS SYS
NIC1    PXB PXB NODE    NODE    PIX  X  NODE    NODE    SYS SYS SYS SYS
NIC2    NODE    NODE    PXB PXB NODE    NODE     X  PIX SYS SYS SYS SYS
NIC3    NODE    NODE    PXB PXB NODE    NODE    PIX  X  SYS SYS SYS SYS
NIC4    SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX NODE    NODE
NIC5    SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  NODE    NODE
NIC6    SYS SYS SYS SYS SYS SYS SYS SYS NODE    NODE     X  PIX
NIC7    SYS SYS SYS SYS SYS SYS SYS SYS NODE    NODE    PIX  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

Stacktrace

ERROR 04-28 16:01:15 async_llm_engine.py:499] Engine iteration timed out. This should never happen!
ERROR 04-28 16:01:15 async_llm_engine.py:43] Engine background task failed
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
ERROR 04-28 16:01:15 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     all_outputs = await self._run_workers_async(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     all_outputs = await asyncio.gather(*coros)
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] During handling of the above exception, another exception occurred:
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 04-28 16:01:15 async_llm_engine.py:43]     return fut.result()
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] The above exception was the direct cause of the following exception:
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 04-28 16:01:15 async_llm_engine.py:43]     task.result()
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
ERROR 04-28 16:01:15 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 04-28 16:01:15 async_llm_engine.py:43]     raise exceptions.TimeoutError() from exc
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fdd6c983370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fddf0f1eb60>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fdd6c983370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fddf0f1eb60>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 04-28 16:01:15 async_llm_engine.py:154] Aborted request cmpl-c2bd7aff0e9141b685c9e33e8e7135cb-0.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd74838cdc0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

Request Log

...
INFO 04-28 16:24:30 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:24:40 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:24:50 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:25:00 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:25:10 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
...

The Running: 1 reqs never changed to Running: 0 reqs

NCCL Error

After some time, it complained that there was an NCCL timeout issue.

(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7b1dcab6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7b1dcaec3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7b1dcaf839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #4: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #5: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #6: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,129 E 7156 7629] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7b1dcab6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7b1dcaec3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7b1dcaf839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #4: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #5: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #6: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: <unknown function> + 0xdf6b11 (0x7f7b1da05b11 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #3: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #4: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:108: Stack trace:
(RayWorkerWrapper pid=7156)  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe64fa) [0x7f7fd6df34fa] ray::operator<<()
(RayWorkerWrapper pid=7156) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe8fb8) [0x7f7fd6df5fb8] ray::TerminateHandler()
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f7fd5c8f20c]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f7fd5c8f277]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f7fd5c8f1fe]
(RayWorkerWrapper pid=7156) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xdf6bcc) [0x7f7b1da05bcc] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fd5cbd253]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fd7affac3]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f7fd7b90a04] __clone
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) *** SIGABRT received at time=1714291816 on cpu 51 ***
(RayWorkerWrapper pid=7156) PC: @     0x7f7fd7b019fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=7156)     @     0x7f7fd7aad520  (unknown)  (unknown)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365: *** SIGABRT received at time=1714291816 on cpu 51 ***
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365: PC: @     0x7f7fd7b019fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365:     @     0x7f7fd7aad520  (unknown)  (unknown)
(RayWorkerWrapper pid=7156) Fatal Python error: Aborted
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, simplejson._speedups, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._json, PIL._imaging, __triton_launcher, cuda_utils (total: 36)
(RayWorkerWrapper pid=7458) [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:172.17.0.3]:47044 (errno: 97 - Address family not supported by protocol). [repeated 2x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffcb1cf47f2bc1c19ce70ffdfe01000000 Worker ID: 037b4db9866efdc63cfb62664ff3e6aa96ef26495b3ad2cbdc1dca92 Node ID: 5926d48174b708f57246402a749a9c5025218f0a2d1439d8aaaa28e7 Worker IP address: 172.17.0.3 Worker port: 46404 Worker PID: 7156 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorkerWrapper pid=7458) INFO 04-28 15:52:38 custom_all_reduce.py:246] Registering 5635 cuda graph addresses [repeated 2x across cluster]
(RayWorkerWrapper pid=7458) INFO 04-28 15:52:38 model_runner.py:1057] Graph capturing finished in 31 secs. [repeated 2x across cluster]

Thread stack

We dumped the thread and found that it got stuck during sampling.

# py-spy dump --pid 1
Process 1: python3 -m vllm.entrypoints.openai.api_server --model /models/my-model --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --enable-prefix-caching
Python v3.10.12 (/usr/bin/python3.10)

Thread 0x7F9BB9AFE480 (active): "MainThread"
    run (asyncio/runners.py:44)
    run (uvicorn/server.py:65)
    run (uvicorn/main.py:575)
    <module> (vllm/entrypoints/openai/api_server.py:169)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 7027 (idle): "ray_listen_error_messages"
    listen_error_messages (ray/_private/worker.py:2136)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7028 (idle): "ray_print_logs"
    print_logs (ray/_private/worker.py:898)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7718 (idle): "Thread-1 (_report_usage_worker)"
    _report_continous_usage (vllm/usage/usage_lib.py:186)
    _report_usage_worker (vllm/usage/usage_lib.py:137)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F8B217EB640 (active): "ThreadPoolExecutor-0_0"
    _random_sample (vllm/model_executor/layers/sampler.py:292)
    _sample_with_torch (vllm/model_executor/layers/sampler.py:495)
    _sample (vllm/model_executor/layers/sampler.py:593)
    forward (vllm/model_executor/layers/sampler.py:90)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    sample (vllm/model_executor/models/llama.py:375)
    execute_model (vllm/worker/model_runner.py:858)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    run (concurrent/futures/thread.py:58)
    _worker (concurrent/futures/thread.py:83)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

Host software

GPUs: A800 x 8 (single node, multi-GPU) NVIDIA Driver: 525.85.12 NVIDIA GPU plugin related software:

libnvidia-container-tools.x86_64 1.8.0-1                       
libnvidia-container1.x86_64      1.8.0-1                    
nvidia-container-toolkit.x86_64  1.8.0-1                 
nvidia-docker2.noarch            2.9.0-1                        
nvidia-fabric-manager.x86_64     525.85.12-1

How easy is it to reproduce the issue?

Also, Is it possible to reproduce it with CUDA_LAUNCH_BLOCKING=1 and show us the line?

How easy is it to reproduce the issue?

It's about 1/10 I think. It seemed to be very random, at least not directly caused by request concurrency, nor prompt length per our observation.

Also, Is it possible to reproduce it with CUDA_LAUNCH_BLOCKING=1 and show us the line?

We just tried. Here's the stacktrace with the env variable

ERROR 04-30 11:35:13 async_llm_engine.py:499] Engine iteration timed out. This should never happen!
ERROR 04-30 11:35:13 async_llm_engine.py:43] Engine background task failed
ERROR 04-30 11:35:13 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
ERROR 04-30 11:35:13 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
ERROR 04-30 11:35:13 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
ERROR 04-30 11:35:13 async_llm_engine.py:43]     all_outputs = await self._run_workers_async(
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
ERROR 04-30 11:35:13 async_llm_engine.py:43]     all_outputs = await asyncio.gather(*coros)
ERROR 04-30 11:35:13 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-30 11:35:13 async_llm_engine.py:43]
ERROR 04-30 11:35:13 async_llm_engine.py:43] During handling of the above exception, another exception occurred:
ERROR 04-30 11:35:13 async_llm_engine.py:43]
ERROR 04-30 11:35:13 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 04-30 11:35:13 async_llm_engine.py:43]     return fut.result()
ERROR 04-30 11:35:13 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-30 11:35:13 async_llm_engine.py:43]
ERROR 04-30 11:35:13 async_llm_engine.py:43] The above exception was the direct cause of the following exception:
ERROR 04-30 11:35:13 async_llm_engine.py:43]
ERROR 04-30 11:35:13 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 04-30 11:35:13 async_llm_engine.py:43]     task.result()
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
ERROR 04-30 11:35:13 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 04-30 11:35:13 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 04-30 11:35:13 async_llm_engine.py:43]     raise exceptions.TimeoutError() from exc
ERROR 04-30 11:35:13 async_llm_engine.py:43] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f98ef38f370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9973926b60>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f98ef38f370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9973926b60>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 04-30 11:35:13 async_llm_engine.py:154] Aborted request cmpl-613f9fd468a145a6905faa7a123f4de1-0.
INFO:     10.32.76.62:49926 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 153, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 228, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 213, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 661, in generate
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 655, in generate
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
INFO 04-30 11:35:16 async_llm_engine.py:524] Received request cmpl-fbdfba046c0d4be8b48977620f4469ac-0: prompt: ..., lora_request: None.
INFO 04-30 11:35:16 async_llm_engine.py:154] Aborted request cmpl-fbdfba046c0d4be8b48977620f4469ac-0.
INFO:     10.32.76.62:50712 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 153, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 228, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 213, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 661, in generate
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 655, in generate
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 105, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 153, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 228, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 213, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 661, in generate
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 645, in generate
    stream = await self.add_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in add_request
    self.start_background_loop()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 406, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

And this time, the thread stack is different:

# py-spy dump --pid 1
Process 1: python3 -m vllm.entrypoints.openai.api_server --model /models/my-model --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --enable-prefix-caching
Python v3.10.12 (/usr/bin/python3.10)

Thread 0x7F9A4880F480 (active): "MainThread"
    run (asyncio/runners.py:44)
    run (uvicorn/server.py:65)
    run (uvicorn/main.py:575)
    <module> (vllm/entrypoints/openai/api_server.py:169)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 7029 (idle): "ray_listen_error_messages"
    listen_error_messages (ray/_private/worker.py:2136)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7030 (idle): "ray_print_logs"
    print_logs (ray/_private/worker.py:898)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7717 (idle): "Thread-1 (_report_usage_worker)"
    _report_continous_usage (vllm/usage/usage_lib.py:186)
    _report_usage_worker (vllm/usage/usage_lib.py:137)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F8989DEE640 (active): "ThreadPoolExecutor-0_0"
    _prune_hidden_states (vllm/model_executor/layers/logits_processor.py:78)
    forward (vllm/model_executor/layers/logits_processor.py:47)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    compute_logits (vllm/model_executor/models/llama.py:366)
    execute_model (vllm/worker/model_runner.py:851)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    run (concurrent/futures/thread.py:58)
    _worker (concurrent/futures/thread.py:83)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

FYI, we actually deployed several instances. They're running on different envs.

The following instances have been running for more than 5 days without any problem:

vLLM 0.4.0post1, tp=4 (70B model) @ 4xA100, kubernetes with containerd runtime, nvidia-container-toolkit 1.14.5-1
vLLM 0.4.0post1, tp=2 (13B model) @ 2xA100, kubernetes with containerd runtime, nvidia-container-toolkit 1.14.5-1

The problematic instance was running on:

vLLM 0.4.1(also tried 0.4.0post1), tp=4 (70B model) @ 4xA800, docker 20.10.22, nvidia nvidia-container-toolkit-1.8.0

vLLM 0.3.3 could run on the same env without any problem. We replaced the host with one with the same env (A800 + docker 20.10.22 + nvidia-container-toolkit 1.8.0) and the problem still existed.

The GPU drivers of all the above hosts are 525.85.12. The container images used are all officially released ones.

Can you also share the stacktrace of workers that are not stuck? (or is all workers stuck at the same line?)

Also, is there code I can try reproducing it in our env?

Can you also share the stacktrace of workers that are not stuck? (or is all workers stuck at the same line?)

Not sure whether the following is what we need, but here we go

root@585c3300a5f9:/vllm-workspace# ps aux | grep RayWorker
root        7065  0.2  0.0 24370672 421272 ?     SNl  11:25   0:43 ray::RayWorkerWrapper
root        7155 99.0  0.6 130614300 7332572 ?   RNl  11:25 250:02 ray::RayWorkerWrapper.execute_method
root        7310 99.0  0.6 130614300 7363044 ?   RNl  11:25 250:00 ray::RayWorkerWrapper.execute_method
root        7457 99.0  0.6 130221220 7369628 ?   RNl  11:25 249:58 ray::RayWorkerWrapper.execute_method

So I guess I need to dump the thread stacks of the processes with pid 7065, 7155, 7310 and 7457

PID 7065

root@585c3300a5f9:/vllm-workspace# py-spy dump --pid 7065
Process 7065: ray::RayWorkerWrapper
Python v3.10.12 (/usr/bin/python3.10)

Thread 7065 (idle): "MainThread"
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)

PID 7155

root@585c3300a5f9:/vllm-workspace# py-spy dump --pid 7155
Process 7155: ray::RayWorkerWrapper.execute_method
Python v3.10.12 (/usr/bin/python3.10)

Thread 7155 (active): "MainThread"
    _prune_hidden_states (vllm/model_executor/layers/logits_processor.py:78)
    forward (vllm/model_executor/layers/logits_processor.py:47)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    compute_logits (vllm/model_executor/models/llama.py:366)
    execute_model (vllm/worker/model_runner.py:851)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:691)
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)

PID 7310

root@585c3300a5f9:/vllm-workspace# py-spy dump --pid 7310
Process 7310: ray::RayWorkerWrapper.execute_method
Python v3.10.12 (/usr/bin/python3.10)

Thread 7310 (active): "MainThread"
    _prune_hidden_states (vllm/model_executor/layers/logits_processor.py:78)
    forward (vllm/model_executor/layers/logits_processor.py:47)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    compute_logits (vllm/model_executor/models/llama.py:366)
    execute_model (vllm/worker/model_runner.py:851)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:691)
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)

PID 7457

root@585c3300a5f9:/vllm-workspace# py-spy dump --pid 7457
Process 7457: ray::RayWorkerWrapper.execute_method
Python v3.10.12 (/usr/bin/python3.10)

Thread 7457 (active): "MainThread"
    _prune_hidden_states (vllm/model_executor/layers/logits_processor.py:78)
    forward (vllm/model_executor/layers/logits_processor.py:47)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    compute_logits (vllm/model_executor/models/llama.py:366)
    execute_model (vllm/worker/model_runner.py:851)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:691)
    main_loop (ray/_private/worker.py:876)
    <module> (ray/_private/workers/default_worker.py:289)

Also, is there code I can try reproducing it in our env?

We're were sending requests directly to the vllm container using curl, without any in-house code.

The container was started with the following command

$ sudo docker run -d --restart always --name LLAMA2-BASED-MODEL -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size 16g -p 8020:8000 -v LOCAL_70B_MODEL_PATH:/models/my-model -v /etc/localtime:/etc/localtime  vllm/vllm-openai:v0.4.1 --model /models/my-model  --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --enable-prefix-caching

Our testing prompt was in Chinese. Other prompts can also trigger the bug, and it appears that the longer the prompt is, the greater the likelihood of the bug occurring.

curl -X POST http://IP:PORT/v1/completions -H 'Content-Type: application/json' -d '{"model": "/models/my-model", "prompt": " 按照我国养老金改革的时间表，今年10月份，我国养老金双轨制改革将结束10年的过渡期，实现最终并轨。很多人都关心，这到底会带来什么样实质性的影响。\n\n我国的养老金主要分为两大体系，一是机关事业单位，二是企业职工养老，也就是人们俗称的体制内和体制外。一直以来，我国体制内外的养老金存在较大的差距，主要体现为，体制内的工作人员不需要单独自己缴纳养老金，但是退休后可以拿到更多退休金，而企业职工需要自己缴纳养老金，退休后的养老金反而更低。\n\n以养老金替代率这个指标来看，企业职工的养老替代率只有40%多，而体制内的养老金替代率则在80%-90%，也就是说，如果在退休前工资为1万元，企业职工退休后能拿4000多元养老金，而机关事业的人员能拿八九千元。很多体制内人员的退休收入，比年轻人的正常工资还高，这曾长期引发讨论。\n\n由于各界对于养老金双轨制改革的呼声很大，2014年10月，国务院发布文件，正式对机关事业单位工作人员的养老保险制度进行改革，改革的目的是“建立更加公平、可持续的养老保险制度”。\n\n当时最大的改革力度在于，让体制内的工作人员和企业职工一样，实行社会统筹与个人账户相结合的基本养老保险制度。以前体制内工作人员不需要自己缴纳养老金，改革之后需要自己缴费。\n\n为了避免并轨对体制内的工作人员退休待遇带来明显冲击，改革将体制内的工作人员分为三类。改革之前（2014年10月）退休的“老人”，完全不受影响，待遇照旧；改革之后参加工作的“新人”，完全按照新办法执行，需要自己为养老金账户缴费。\n\n最复杂的情况是“中人”，也就是改革前（2014年10月）参加工作，改革后陆续退休的工作人员。对这部分“中人”的退休金，当时制定了一个“保低限高”的政策，也就是按照新老退休方法作对比，如果新办法计算的养老金待遇更低，那就按照老办法发放，确保养老金待遇不下降；如果新办法计算的养老金待遇更高，超出的部分每年按照一定比例发放，比如第一年发放10%，第二年发放20%，到2024年发放100%。以此来看，当时的改革对于“中人”而言，其实并没有让退休待遇下降，反而会让更多人退休待遇提升。\n\n那么，2024年的养老金并轨又是什么概念呢？\n\n2014年10月，我国对机关事业单位养老金改革时，对于“中人”的养老金设置了10年过渡期（2014年10月1日到2024年9月30日），10年过渡期结束后，也就是到了2024年10月，机关事业单位工作人员就将全部按照改革后的新办法领取养老金，老办法正式终结，这就是大家通常说的养老金并轨。\n\n那么，养老金并轨之后，是否就意味着体制内的养老金会大幅下降，体制内外的养老金彻底一致了呢？或者说，以前体制内的养老金替代率高达80%—90%，而企业职工养老金替代率只有40%多的现象就将彻底改变？\n\n显然不是。养老金双轨制并轨，并不意味着体制内外的养老金就会完全一样，和企业养老金相比，体制内的养老金仍然具有优势。\n\n从体制内“中人”养老待遇的变化其实就能看出来。2014年养老金改革之后，体制内的“中人”领取养老金，开始新老办法并用。从过去10年过渡期的统计数据来看，绝大部分体制内人员都是按照新办法发放退休待遇，因为新办法计算的待遇比老办法更高。今年10月过渡期结束之后，很多体制内工作人员的养老金将不再受到“保低限高”的约束，每年也不用被限制在一定涨幅之内。\n\n为什么养老金改革后还是无法抹平体制内外的养老金差距呢？\n\n首先，很多企业的养老金都是按照最低标准来缴纳，尤其是一些小微企业，只是为了应付检查，而体制内的养老金，大都是按照实际工资足额缴纳，当做工作人员的福利，所以，缴费基准不同，决定了最终领取养老金的差距。\n\n其次，机关事业养老金改革之后，为了保证退休待遇不下降，当时的文件规定“机关事业单位在参加基本养老保险的基础上，应当为其工作人员建立职业年金。单位按本单位工资总额的8%缴费，个人按本人缴费工资的4%缴费。工作人员退休后，按月领取职业年金待遇。”\n\n这意味着，机关事业单位的职业年金属于强制性，所以短期之内就迅速形成高覆盖。按照人社部在2019年的数据，当年我国机关事业单位的职业年金覆盖率已经高达82%，但是运行时间更长的企业年金，由于不具备强制性，覆盖率还不到10%，且主要集中于国企和央企，绝大多数普通私企职工无法享受企业年金。\n\n一个国家成熟的养老体系主要由三支柱构成，也就是政府主导的第一支柱，单位主导的第二支柱，以及个人缴费的第三支柱。发达国家养老金替代率很高，主要就是靠三大支柱共同发力。\n\n从这三大支柱来看，我国的个人养老金还处于刚刚起步阶段，大部分企业职工的养老只能依靠第一支柱，所以养老金替代率很低。而机关事业单位除了第一支柱外，现在已经迅速建立起了第二支柱。\n\n所以，虽然我国体制内养老金在10年前就开始并轨改革，但这并没有影响年轻人加入体制内的热情，最近几年考公越来越热，就已经足以证明体制内依然对年轻人具有巨大的吸引力。\n\n\n总结一下以上内容，同时回答一下问题：1、我国养老金体系是怎样的，2、最新的养老金制度有哪些变化，3、对于未来养老金制度有哪些建议", "max_tokens": 2000, "temperature": 0.9}'

In our test, we could see the bug after we repeatedly sent the same request around 10 times. The requests were sent one after another from the same console, without any parallelism.

Hmm it is actually interesting PID 7065 is running nothing. It might be the root cause of hanging. Since around that logit access code, all the workers need to call gather to gather the logit to the driver worker, but it seems like driver worker is basically idle?

I am curious if there was any exception from python you are seeing from any logs?

also one interesting thing is you use --enable-prefix-caching. Does it still hang without this flag? (can you just check)? I can try reproducing it on my end in a few days

Hmm it is actually interesting PID 7065 is running nothing. It might be the root cause of hanging. Since around that logit access code, all the workers need to call gather to gather the logit to the driver worker, but it seems like driver worker is basically idle?

I am curious if there was any exception from python you are seeing from any logs?

The gather operation timed out and the thread resumed idling. Clues could be found from my previous replies.

also one interesting thing is you use --enable-prefix-caching. Does it still hang without this flag? (can you just check)? I can try reproducing it on my end in a few days

Just removed --enable-prefix-caching and restarted the container. The bug appeared again in less than 10 requests.

🐛 Describe the bug

Summary

What happened

Your current environment

vLLM was running inside a docker container. The following was collected from inside the container.

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1062.9.1.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
GPU 2: NVIDIA A800-SXM4-80GB
GPU 3: NVIDIA A800-SXM4-80GB

Nvidia driver version: 525.85.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
Frequency boost:                 enabled
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; Load fences, usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0  GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity
GPU0   X  NV8 NV8 NV8 PXB PXB NODE    NODE    SYS SYS SYS SYS 0-31,64-95  0
GPU1  NV8  X  NV8 NV8 PXB PXB NODE    NODE    SYS SYS SYS SYS 0-31,64-95  0
GPU2  NV8 NV8  X  NV8 NODE    NODE    PXB PXB SYS SYS SYS SYS 0-31,64-95  0
GPU3  NV8 NV8 NV8  X  NODE    NODE    PXB PXB SYS SYS SYS SYS 0-31,64-95  0
NIC0  PXB PXB NODE    NODE     X  PIX NODE    NODE    SYS SYS SYS SYS
NIC1  PXB PXB NODE    NODE    PIX  X  NODE    NODE    SYS SYS SYS SYS
NIC2  NODE    NODE    PXB PXB NODE    NODE     X  PIX SYS SYS SYS SYS
NIC3  NODE    NODE    PXB PXB NODE    NODE    PIX  X  SYS SYS SYS SYS
NIC4  SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX NODE    NODE
NIC5  SYS SYS SYS SYS SYS SYS SYS SYS PIX  X  NODE    NODE
NIC6  SYS SYS SYS SYS SYS SYS SYS SYS NODE    NODE     X  PIX
NIC7  SYS SYS SYS SYS SYS SYS SYS SYS NODE    NODE    PIX  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

Stacktrace

ERROR 04-28 16:01:15 async_llm_engine.py:499] Engine iteration timed out. This should never happen!
ERROR 04-28 16:01:15 async_llm_engine.py:43] Engine background task failed
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
ERROR 04-28 16:01:15 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     all_outputs = await self._run_workers_async(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
ERROR 04-28 16:01:15 async_llm_engine.py:43]     all_outputs = await asyncio.gather(*coros)
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] During handling of the above exception, another exception occurred:
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 04-28 16:01:15 async_llm_engine.py:43]     return fut.result()
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] The above exception was the direct cause of the following exception:
ERROR 04-28 16:01:15 async_llm_engine.py:43]
ERROR 04-28 16:01:15 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 04-28 16:01:15 async_llm_engine.py:43]     task.result()
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
ERROR 04-28 16:01:15 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 04-28 16:01:15 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 04-28 16:01:15 async_llm_engine.py:43]     raise exceptions.TimeoutError() from exc
ERROR 04-28 16:01:15 async_llm_engine.py:43] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fdd6c983370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fddf0f1eb60>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fdd6c983370>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fddf0f1eb60>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 470, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
    all_outputs = await self._run_workers_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 496, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 04-28 16:01:15 async_llm_engine.py:154] Aborted request cmpl-c2bd7aff0e9141b685c9e33e8e7135cb-0.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd74838cdc0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

Request Log

...
INFO 04-28 16:24:30 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:24:40 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:24:50 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:25:00 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 04-28 16:25:10 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
...

The Running: 1 reqs never changed to Running: 0 reqs

NCCL Error

After some time, it complained that there was an NCCL timeout issue.

(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
(RayWorkerWrapper pid=7156) [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7b1dcab6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7b1dcaec3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7b1dcaf839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #4: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #5: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #6: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,129 E 7156 7629] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5447, OpType=GATHER, NumelIn=9872, NumelOut=0, Timeout(ms)=600000) ran for 600195 milliseconds before timing out.
(RayWorkerWrapper pid=7156) Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7b1dcab6e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7b1dcaec3d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7b1dcaf839 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #4: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #5: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #6: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
(RayWorkerWrapper pid=7156) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7fd00ced87 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
(RayWorkerWrapper pid=7156) frame #1: <unknown function> + 0xdf6b11 (0x7f7b1da05b11 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
(RayWorkerWrapper pid=7156) frame #2: <unknown function> + 0xdc253 (0x7f7fd5cbd253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
(RayWorkerWrapper pid=7156) frame #3: <unknown function> + 0x94ac3 (0x7f7fd7affac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156) frame #4: clone + 0x44 (0x7f7fd7b90a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:108: Stack trace:
(RayWorkerWrapper pid=7156)  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe64fa) [0x7f7fd6df34fa] ray::operator<<()
(RayWorkerWrapper pid=7156) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xfe8fb8) [0x7f7fd6df5fb8] ray::TerminateHandler()
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f7fd5c8f20c]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f7fd5c8f277]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f7fd5c8f1fe]
(RayWorkerWrapper pid=7156) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xdf6bcc) [0x7f7b1da05bcc] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fd5cbd253]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fd7affac3]
(RayWorkerWrapper pid=7156) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f7fd7b90a04] __clone
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) *** SIGABRT received at time=1714291816 on cpu 51 ***
(RayWorkerWrapper pid=7156) PC: @     0x7f7fd7b019fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=7156)     @     0x7f7fd7aad520  (unknown)  (unknown)
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365: *** SIGABRT received at time=1714291816 on cpu 51 ***
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365: PC: @     0x7f7fd7b019fc  (unknown)  pthread_kill
(RayWorkerWrapper pid=7156) [2024-04-28 16:10:16,138 E 7156 7629] logging.cc:365:     @     0x7f7fd7aad520  (unknown)  (unknown)
(RayWorkerWrapper pid=7156) Fatal Python error: Aborted
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156)
(RayWorkerWrapper pid=7156) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, simplejson._speedups, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._json, PIL._imaging, __triton_launcher, cuda_utils (total: 36)
(RayWorkerWrapper pid=7458) [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:172.17.0.3]:47044 (errno: 97 - Address family not supported by protocol). [repeated 2x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffcb1cf47f2bc1c19ce70ffdfe01000000 Worker ID: 037b4db9866efdc63cfb62664ff3e6aa96ef26495b3ad2cbdc1dca92 Node ID: 5926d48174b708f57246402a749a9c5025218f0a2d1439d8aaaa28e7 Worker IP address: 172.17.0.3 Worker port: 46404 Worker PID: 7156 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorkerWrapper pid=7458) INFO 04-28 15:52:38 custom_all_reduce.py:246] Registering 5635 cuda graph addresses [repeated 2x across cluster]
(RayWorkerWrapper pid=7458) INFO 04-28 15:52:38 model_runner.py:1057] Graph capturing finished in 31 secs. [repeated 2x across cluster]

Thread stack

We dumped the thread and found that it got stuck during sampling.

# py-spy dump --pid 1
Process 1: python3 -m vllm.entrypoints.openai.api_server --model /models/my-model --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --enable-prefix-caching
Python v3.10.12 (/usr/bin/python3.10)

Thread 0x7F9BB9AFE480 (active): "MainThread"
    run (asyncio/runners.py:44)
    run (uvicorn/server.py:65)
    run (uvicorn/main.py:575)
    <module> (vllm/entrypoints/openai/api_server.py:169)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
Thread 7027 (idle): "ray_listen_error_messages"
    listen_error_messages (ray/_private/worker.py:2136)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7028 (idle): "ray_print_logs"
    print_logs (ray/_private/worker.py:898)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 7718 (idle): "Thread-1 (_report_usage_worker)"
    _report_continous_usage (vllm/usage/usage_lib.py:186)
    _report_usage_worker (vllm/usage/usage_lib.py:137)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F8B217EB640 (active): "ThreadPoolExecutor-0_0"
    _random_sample (vllm/model_executor/layers/sampler.py:292)
    _sample_with_torch (vllm/model_executor/layers/sampler.py:495)
    _sample (vllm/model_executor/layers/sampler.py:593)
    forward (vllm/model_executor/layers/sampler.py:90)
    _call_impl (torch/nn/modules/module.py:1520)
    _wrapped_call_impl (torch/nn/modules/module.py:1511)
    sample (vllm/model_executor/models/llama.py:375)
    execute_model (vllm/worker/model_runner.py:858)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_model (vllm/worker/worker.py:249)
    decorate_context (torch/utils/_contextlib.py:115)
    execute_method (vllm/worker/worker_base.py:149)
    run (concurrent/futures/thread.py:58)
    _worker (concurrent/futures/thread.py:83)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

Host software

GPUs: A800 x 8 (single node, multi-GPU) NVIDIA Driver: 525.85.12 NVIDIA GPU plugin related software:

libnvidia-container-tools.x86_64 1.8.0-1                       
libnvidia-container1.x86_64      1.8.0-1                    
nvidia-container-toolkit.x86_64  1.8.0-1                 
nvidia-docker2.noarch            2.9.0-1                        
nvidia-fabric-manager.x86_64     525.85.12-1

We also encounter the same issue when deploying Qwen 72b with Tensor Parallelism. Inspired by these interesting findings, we further discover something interesting: while deploying with tp=4, CPU RAM usage continuously increases. However, when a model is deployed on a single GPU(with out tensor parallelism), the CPU RAM usage remains steady, even after processing thousands of requests.

with tp=4, CPU RAM usage continuously increases.

Can you create a separate issue for this?

Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately https://github.com/vllm-project/vllm/pull/4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week

FYI - this might have something to do with the custom all reduce operation. We have observed this same issue but it went away after specifying --disable-custom-all-reduce when launching the server.

with tp=4, CPU RAM usage continuously increases.

Can you create a separate issue for this?

Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately #4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week

--disable-custom-all-reduce = True
--enforce-eager = True (may be unnecessary)
update to the https://github.com/vllm-project/vllm/pull/4557 This three can solve the watchdog problem for me

before this, nccl watchdog error happens several times per day, and now, it works well

@changyuanzhangchina do you need to set all of them, or just one of them fixes it?

I wonder if 3 itself is sufficient to fix the issue espeically

@changyuanzhangchina do you need to set all of them, or just one of them fixes it?

I wonder if 3 itself is sufficient to fix the issue espeically

condition 3 has great probability for the root cause

while for condition 1, we don't know is there any problem for all the serving envs.

Thus, we also disable it.

As to condition 2, we have found memory leakage several months ago, while this may have also been fixed.

with tp=4, CPU RAM usage continuously increases.

Can you create a separate issue for this? Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately #4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week

--disable-custom-all-reduce = True

--enforce-eager = True (may be unnecessary)

update to the [Core] Ignore infeasible swap requests. #4557 This three can solve the watchdog problem for me

before this, nccl watchdog error happens several times per day, and now, it works well

use --disable-custom-all-reduce = True； The vllm service cannot be started

you mean using that flag gives you the error?

I encountered the same error when using command "vllm.entrypoints.openai.api_server" to start my server. The model server would get stuck everytime I post some concurrent and long requests.

But I temporarily got it solved by rewriting openai format http services and downgraded vllm into 0.3.0.(formely 0.4.0) i.e., I initialized my vllm server using AsyncLLMEngine, and wrote my own openai format routers as my entrypoint.

I have tried nealy one hundred requests and the error seemed to disappear, not sure where the problem lies though.

I'll continously share updates if anything new.

it'd be also great to try the latest master to see if it fixes the issue (or after 0.4.3 is released) because https://github.com/vllm-project/vllm/pull/4557 could be the root cause if you see hangs from long context size

Disabling the custom all-reduce functionality with the --disable-custom-all-reduce flag resolves the issue. We've tested this successfully with both the vllm:0.4.1 and vllm:0.4.2 Docker images, without needing to enable eager mode or switch to the main branch.

It's worth noting that this issue might not be directly related to https://github.com/vllm-project/vllm/pull/4557 for a couple of reasons:

We've been running another two instances on 4xA100 GPUs with 80GB of memory each for weeks without encountering any problems, and we haven't had to use the --disable-custom-all-reduce flag.
Our test requests are sent sequentially, and according to vLLM's statistics, GPU KV cache usage stays below 1%, while CPU KV Cache usage remains at 0%.

UPDATE on 2024-05-23

Workaround: Use the --disable-custom-all-reduce flag when starting the vLLM instance. Thanks @ywang96 !

@itechbear Glad that this does resolve your issue - I suspect it has something to do with the topology of the GPUs when not all of them in the box are used for serving.

循着问题找到这里，在0.5.0.post1版本上，按照仓库的dockerfile编译的docker镜像，qwen14B运行在4张4090上，--disable-custom-all-reduce没有能够解决我的问题

with tp=4, CPU RAM usage continuously increases.

Can you create a separate issue for this? Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately #4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week

--disable-custom-all-reduce = True

--enforce-eager = True (may be unnecessary)

update to the [Core] Ignore infeasible swap requests. #4557 This three can solve the watchdog problem for me

before this, nccl watchdog error happens several times per day, and now, it works well

Today is July 5th. The above three strategies cannot solve this problem. Is there any other tips to deal with it?

for us, this three is enough you can also update the newest main branch to check whether this exists again.

with tp=4, CPU RAM usage continuously increases.

Can you create a separate issue for this? Regarding the issue itself, can you guys try the latest matser? There was one issue that caused blocking that we fixed lately #4557. Just want to make sure this was not the root cause. I will have time to try repro the issue this week

--disable-custom-all-reduce = True

--enforce-eager = True (may be unnecessary)

update to the [Core] Ignore infeasible swap requests. #4557 This three can solve the watchdog problem for me

before this, nccl watchdog error happens several times per day, and now, it works well

Today is July 5th. The above three strategies cannot solve this problem. Is there any other tips to deal with it?

--disable-custom-all-reduce does not work for me either. Still testing for other workarounds.

can this problem fixed?

vllm-project / vllm