Open cduk opened 5 months ago
Further investigation reveals that this happens for requests with a very long context and the engine timesout but doesn't manage to clean-up:
ERROR 04-11 09:28:30 async_llm_engine.py:482] Engine iteration timed out. This should never happen! ERROR 04-11 09:28:30 async_llm_engine.py:43] Engine background task failed ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 453, in engine_step ERROR 04-11 09:28:30 async_llm_engine.py:43] request_outputs = await self.engine.step_async() ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async ERROR 04-11 09:28:30 async_llm_engine.py:43] output = await self.model_executor.execute_model_async( ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in execute_model_async ERROR 04-11 09:28:30 async_llm_engine.py:43] output = await make_async(self.driver_worker.execute_model)( ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] During handling of the above exception, another exception occurred: ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for ERROR 04-11 09:28:30 async_llm_engine.py:43] return fut.result() ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] The above exception was the direct cause of the following exception: ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish ERROR 04-11 09:28:30 async_llm_engine.py:43] task.result() ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop ERROR 04-11 09:28:30 async_llm_engine.py:43] has_requests_in_progress = await asyncio.wait_for( ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for ERROR 04-11 09:28:30 async_llm_engine.py:43] raise exceptions.TimeoutError() from exc ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.TimeoutError Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f9cd1428550>, error_callback=<bound method AsyncLLMEngine. _error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9cd2599cf0>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f9cd1428550>, error_callback=<bound method AsyncLLMEngine. _error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9cd2599cf0>>)> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 453, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in execute_model_async output = await make_async(self.driver_worker.execute_model)( asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. INFO 04-11 09:28:30 async_llm_engine.py:154] Aborted request cmpl-2a9dc0ac88b247b2b1e635db502c36b2. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f9c560ed780
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app await response(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call async with anyio.create_task_group() as task_group: File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in aexit raise BaseExceptionGroup( exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) INFO 04-11 09:28:32 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.0%, CPU KV cache usage: 0.0%
At the end, the terminated request is still shown as running.
I think I get the same one :/
Can you share instructions to reproduce the issue on our side?
I saw this with Mixtral, TP-4 on 4xA10, and prefix caching enabled.
Setting a timeout on the client requests and closing the socket on the client side and then load testing the server, I saw a similar error message logged. After the load test concluded and no clients were connected, the "running" reqs was permanently above zero.
Confirming is this the openai server (entrypoints.openai.api_server
) not the simple demo server (entrypoints.api_server
)? I believe cancellation handling is different.
Additionally, whether the client is using streaming can help us debug.
I confirm this is the OpenAI server. I think there are 2 ways to reproduce:
@simon-mo Streaming - yes. For server, we were using whichever starts up with the docker container, which I believe is the OpenAI API Server as we were using the OpenAI API.
Both cases were streaming.
Just to add, it seems to be the timeout if prompt processing takes too long. You can use a large model and a slow GPU to reproduce, (or more easily, just reduce the timeout in the code to simulate it - default is 60 seconds).
Your current environment
When testing vLLM I noticed that sometimes when I have a client makes a request and the client terminates abnormally, the request still is shown as running on the vLLM server. Is there a way to send a signal to vLLM to purge the existing queue and reset?
Of course, I can just kill the server and restart it, but because vLLM takes a long time to start-up (almost an hour) I want to find a way to avoid that.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.