[Bug]: AsyncEngineDeadError: Task finished unexpectedly with Gemma2 9B

Your current environment

Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.35

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 555.85
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             28
On-line CPU(s) list:                0-27
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) i7-14700K
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6835.19
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          672 KiB (14 instances)
L1i cache:                          448 KiB (14 instances)
L2 cache:                           28 MiB (14 instances)
L3 cache:                           33 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] onnx==1.16.1
[pip3] onnxruntime==1.18.1
[pip3] sentence-transformers==3.0.1
[pip3] torch==2.3.0
[pip3] torchaudio==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] flashinfer                0.0.8+cu121torch2.3          pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] sentence-transformers     3.0.1                    pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchaudio                2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.42.3                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

i am using the latest vllm, and FlashInfer (flashinfer-0.0.8+cu121torch2.3-cp39-cp39-linux_x86_64.whl), when using Gemma2 with the fllowing code: engine_args = AsyncEngineArgs(model=model_id, disable_log_requests=True, disable_log_stats=True, trust_remote_code=True, gpu_memory_utilization=0.9, dtype=torch.float16, quantization=quantization, load_format=load_format)

    self.engine = AsyncLLMEngine.from_engine_args(engine_args)
    results_generator = self.engine.generate(formatted_prompt, self.sampling_params, request_id)
    index, num_tokens = 0, 0
    buffer = ""
    output_text = ""
    async for request_output in results_generator:
        if request_output.outputs[0].text and "\ufffd" == request_output.outputs[0].text[-1]:
            continue
        if first_token_time is None:
            first_token_time = time.monotonic_ns()
        text_delta = request_output.outputs[0].text[index:]
        index = len(request_output.outputs[0].text)
        num_tokens = len(request_output.outputs[0].token_ids)

        output_text+= text_delta
        buffer += text_delta
        if len(buffer) > 100:  # Adjust buffer size as needed
            print(buffer, end='', flush=True)
            buffer = ""

    # Print remaining buffer
    if buffer:
        print(buffer, end='', flush=True)

(model_id: ModelCloud/gemma-2-9b-it-gptq-4bit) after the first successfull test generation, i had the following error:

Prompt length: 927, New tokens: 200, Time to first: 0.08s, Prompt tokens per second: 10961.46 tps, New tokens per second: 64.38 tps Input: Show me how to perform a linear regression analysis using scikit-learn and plot the results using matplotlib. ERROR 07-07 10:35:29 async_llm_engine.py:616] Engine iteration timed out. This should never happen! ERROR 07-07 10:35:29 async_llm_engine.py:53] Engine background task failed ERROR 07-07 10:35:29 async_llm_engine.py:53] Traceback (most recent call last): ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 593, in run_engine_loop ERROR 07-07 10:35:29 async_llm_engine.py:53] await asyncio.sleep(0) ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/asyncio/tasks.py", line 641, in sleep ERROR 07-07 10:35:29 async_llm_engine.py:53] await sleep0() ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/asyncio/tasks.py", line 630, in sleep0 ERROR 07-07 10:35:29 async_llm_engine.py:53] yield ERROR 07-07 10:35:29 async_llm_engine.py:53] asyncio.exceptions.CancelledError ERROR 07-07 10:35:29 async_llm_engine.py:53] ERROR 07-07 10:35:29 async_llm_engine.py:53] During handling of the above exception, another exception occurred: ERROR 07-07 10:35:29 async_llm_engine.py:53] ERROR 07-07 10:35:29 async_llm_engine.py:53] Traceback (most recent call last): ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion ERROR 07-07 10:35:29 async_llm_engine.py:53] return_value = task.result() ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 593, in run_engine_loop ERROR 07-07 10:35:29 async_llm_engine.py:53] await asyncio.sleep(0) ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit ERROR 07-07 10:35:29 async_llm_engine.py:53] self._do_exit(exc_type) ERROR 07-07 10:35:29 async_llm_engine.py:53] File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 07-07 10:35:29 async_llm_engine.py:53] raise asyncio.TimeoutError ERROR 07-07 10:35:29 async_llm_engine.py:53] asyncio.exceptions.TimeoutError Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:33 handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:33> Traceback (most recent call last): File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 593, in run_engine_loop await asyncio.sleep(0) File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/asyncio/tasks.py", line 641, in sleep await sleep0() File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/asyncio/tasks.py", line 630, in sleep0 yield asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 43, in _log_task_completion return_value = task.result() File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 593, in run_engine_loop await asyncio.sleep(0) File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit self._do_exit(exc_type) File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/lawnel/miniconda3/envs/llm/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.

vllm-project / vllm

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with Gemma2 9B #6190

Your current environment

🐛 Describe the bug