[Bug]: Engine iteration timed out. This should never happen!

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Apache License 2.0

29.84k stars 4.5k forks source link

Collecting environment information... PyTorch version: 2.3.1+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 Clang version: Could not collect CMake version: version 3.30.1 Libc version: glibc-2.35 Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.14.0-472.el9.x86_64-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 6 Stepping: 10 BogoMIPS: 5184.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat md_clear flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 3 MiB (12 instances) L3 cache: 54 MiB (6 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Unknown: Dependent on hypervisor status Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] intel-extension-for-pytorch==2.3.100+git0eb3473 [pip3] numpy==1.26.4 [pip3] torch==2.3.1+cpu [pip3] torchvision==0.18.1+cpu [pip3] transformers==4.43.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

INFO 07-25 15:46:17 logger.py:36] Received request chat-bbbaddce58a249d994a645d9ce8bf7ae: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho won the world series in 2020?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 15546, 2834, 279, 1917, 4101, 304, 220, 2366, 15, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None. INFO 07-25 15:46:17 async_llm_engine.py:173] Added request chat-bbbaddce58a249d994a645d9ce8bf7ae. INFO 07-25 15:46:19 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 07-25 15:46:29 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO: 10.244.0.1:57656 - "GET /v1/models HTTP/1.1" 200 OK INFO 07-25 15:46:39 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 07-25 15:46:49 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 07-25 15:46:59 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO: 10.244.0.1:49152 - "GET /v1/models HTTP/1.1" 200 OK INFO 07-25 15:47:09 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. ERROR 07-25 15:47:17 async_llm_engine.py:658] Engine iteration timed out. This should never happen! ERROR 07-25 15:47:17 async_llm_engine.py:56] Engine background task failed ERROR 07-25 15:47:17 async_llm_engine.py:56] Traceback (most recent call last): ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop ERROR 07-25 15:47:17 async_llm_engine.py:56] done, _ = await asyncio.wait( ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait ERROR 07-25 15:47:17 async_llm_engine.py:56] return await _wait(fs, timeout, return_when, loop) ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait ERROR 07-25 15:47:17 async_llm_engine.py:56] await waiter ERROR 07-25 15:47:17 async_llm_engine.py:56] asyncio.exceptions.CancelledError ERROR 07-25 15:47:17 async_llm_engine.py:56] ERROR 07-25 15:47:17 async_llm_engine.py:56] During handling of the above exception, another exception occurred: ERROR 07-25 15:47:17 async_llm_engine.py:56] ERROR 07-25 15:47:17 async_llm_engine.py:56] Traceback (most recent call last): ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion ERROR 07-25 15:47:17 async_llm_engine.py:56] return_value = task.result() ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop ERROR 07-25 15:47:17 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__ ERROR 07-25 15:47:17 async_llm_engine.py:56] self._do_exit(exc_type) ERROR 07-25 15:47:17 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 07-25 15:47:17 async_llm_engine.py:56] raise asyncio.TimeoutError ERROR 07-25 15:47:17 async_llm_engine.py:56] asyncio.exceptions.TimeoutError 2024-07-25 15:47:17,939 - base_events.py - asyncio - ERROR - Exception in callback _log_task_completion(error_callback=<bound method...7fb66aea9690>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36 handle: <Handle _log_task_completion(error_callback=<bound method...7fb66aea9690>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop done, _ = await asyncio.wait( File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait return await _wait(fs, timeout, return_when, loop) File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait await waiter asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion INFO 07-25 15:47:17 async_llm_engine.py:180] Aborted request chat-bbbaddce58a249d994a645d9ce8bf7ae. return_value = task.result() File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__ self._do_exit(exc_type) File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause. INFO: 127.0.0.1:49340 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop done, _ = await asyncio.wait( File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait return await _wait(fs, timeout, return_when, loop) File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait await waiter asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__ return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app response = await func(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 130, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 197, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 448, in chat_completion_full_generator async for res in result_generator: File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 772, in generate async for output in self._process_request( File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 888, in _process_request raise e File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 884, in _process_request async for request_output in stream: File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 93, in __anext__ raise result File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion return_value = task.result() File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__ self._do_exit(exc_type) File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit raise asyncio.TimeoutError asyncio.exceptions.TimeoutError

{ "model": "Meta-Llama-3.1-8B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Who won the world series in 2020?" } ] }

I can provide some extra environment information. I start a minikube with kubernetes env: minikube start -n 1 --memory=24g --cpus=10 --kubernetes-version=v1.30.0 --disk-size=120g --namespace kelcin --extra-config=apiserver.service-node-port-range=1-65535

and locally build a image tagged with kelcin-vllm-cpu-env:2.0.1 by docker build Dockerfile.cpu, buid image command: docker build -f Dockerfile.cpu -t kelcin-vllm-cpu-env:2.0.1 --shm-size=20g . and create a helm chart vllm-helm-0.1.0.zip

type helm install command(don't forget to put model file into target direcotry): helm install vllm ./vllm-helm-0.1.0.tgz

wait for a while until the pod is ready.

call api /v1/models result is correct

but call api /v1/chat/completions will occur exception mentioned above:

request payload is:

{
    "model": "Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Who won the world series in 2020?"
        }
    ]
}

Exception is

INFO 07-28 07:34:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:47570 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     10.244.0.1:47582 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:34:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:33206 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:29 logger.py:36] Received request chat-abdc43944b884d3d853965bceac6f110: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho won the world series in 2020?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 15546, 2834, 279, 1917, 4101, 304, 220, 2366, 15, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 07-28 07:35:29 async_llm_engine.py:173] Added request chat-abdc43944b884d3d853965bceac6f110.
INFO 07-28 07:35:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:57848 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     10.244.0.1:57836 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:52104 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:36:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
ERROR 07-28 07:36:29 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 07-28 07:36:29 async_llm_engine.py:56] Engine background task failed
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 07-28 07:36:29 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 07-28 07:36:29 async_llm_engine.py:56]     await waiter
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 07-28 07:36:29 async_llm_engine.py:56] 
ERROR 07-28 07:36:29 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 07-28 07:36:29 async_llm_engine.py:56] 
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 07-28 07:36:29 async_llm_engine.py:56]     return_value = task.result()
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 07-28 07:36:29 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 07-28 07:36:29 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-07-28 07:36:29,312 - base_events.py - asyncio - ERROR - Exception in callback _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 07-28 07:36:29 async_llm_engine.py:180] Aborted request chat-abdc43944b884d3d853965bceac6f110.
INFO:     127.0.0.1:37684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 130, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 197, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 448, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 888, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 884, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
INFO 07-28 07:36:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

vllm-project / vllm

[Bug]: Engine iteration timed out. This should never happen! #6790

Your current environment

🐛 Describe the bug