vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.65k stars 4.08k forks source link

[Usage]: How to purge zombie requests #4000

Open cduk opened 5 months ago

cduk commented 5 months ago

Your current environment

When testing vLLM I noticed that sometimes when I have a client makes a request and the client terminates abnormally, the request still is shown as running on the vLLM server. Is there a way to send a signal to vLLM to purge the existing queue and reset?

Of course, I can just kill the server and restart it, but because vLLM takes a long time to start-up (almost an hour) I want to find a way to avoid that.

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

cduk commented 5 months ago

Further investigation reveals that this happens for requests with a very long context and the engine timesout but doesn't manage to clean-up:

ERROR 04-11 09:28:30 async_llm_engine.py:482] Engine iteration timed out. This should never happen! ERROR 04-11 09:28:30 async_llm_engine.py:43] Engine background task failed ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 453, in engine_step ERROR 04-11 09:28:30 async_llm_engine.py:43] request_outputs = await self.engine.step_async() ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async ERROR 04-11 09:28:30 async_llm_engine.py:43] output = await self.model_executor.execute_model_async( ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in execute_model_async ERROR 04-11 09:28:30 async_llm_engine.py:43] output = await make_async(self.driver_worker.execute_model)( ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] During handling of the above exception, another exception occurred: ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for ERROR 04-11 09:28:30 async_llm_engine.py:43] return fut.result() ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.CancelledError ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] The above exception was the direct cause of the following exception: ERROR 04-11 09:28:30 async_llm_engine.py:43] ERROR 04-11 09:28:30 async_llm_engine.py:43] Traceback (most recent call last): ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish ERROR 04-11 09:28:30 async_llm_engine.py:43] task.result() ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop ERROR 04-11 09:28:30 async_llm_engine.py:43] has_requests_in_progress = await asyncio.wait_for( ERROR 04-11 09:28:30 async_llm_engine.py:43] File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for ERROR 04-11 09:28:30 async_llm_engine.py:43] raise exceptions.TimeoutError() from exc ERROR 04-11 09:28:30 async_llm_engine.py:43] asyncio.exceptions.TimeoutError Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f9cd1428550>, error_callback=<bound method AsyncLLMEngine. _error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9cd2599cf0>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f9cd1428550>, error_callback=<bound method AsyncLLMEngine. _error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f9cd2599cf0>>)> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 453, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in execute_model_async output = await make_async(self.driver_worker.execute_model)( asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 479, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. INFO 04-11 09:28:30 async_llm_engine.py:154] Aborted request cmpl-2a9dc0ac88b247b2b1e635db502c36b2. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 568, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f9c560ed780

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app await response(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call async with anyio.create_task_group() as task_group: File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 678, in aexit raise BaseExceptionGroup( exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) INFO 04-11 09:28:32 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 16.0%, CPU KV cache usage: 0.0%

At the end, the terminated request is still shown as running.

zzmicer commented 5 months ago

I think I get the same one :/

robertgshaw2-neuralmagic commented 5 months ago

Can you share instructions to reproduce the issue on our side?

AaronFriel commented 5 months ago

I saw this with Mixtral, TP-4 on 4xA10, and prefix caching enabled.

Setting a timeout on the client requests and closing the socket on the client side and then load testing the server, I saw a similar error message logged. After the load test concluded and no clients were connected, the "running" reqs was permanently above zero.

simon-mo commented 5 months ago

Confirming is this the openai server (entrypoints.openai.api_server) not the simple demo server (entrypoints.api_server)? I believe cancellation handling is different.

Additionally, whether the client is using streaming can help us debug.

cduk commented 5 months ago

I confirm this is the OpenAI server. I think there are 2 ways to reproduce:

  1. Where inference takes >60 seconds. I see there is a 60 second timeout (any way to change this - maybe add an engine parameter).
  2. Another time I saw this: connect with client in streaming mode, kill or timeout client before first streaming chunk is sent - though I'm not sure how easy/reliable it is to reproduce this way.
AaronFriel commented 5 months ago

@simon-mo Streaming - yes. For server, we were using whichever starts up with the docker container, which I believe is the OpenAI API Server as we were using the OpenAI API.

cduk commented 5 months ago

Both cases were streaming.

cduk commented 5 months ago

Just to add, it seems to be the timeout if prompt processing takes too long. You can use a large model and a slow GPU to reproduce, (or more easily, just reduce the timeout in the code to simulate it - default is 60 seconds).

cduk commented 5 months ago

I think the relevant code is here:

main from vllm.usage.usage_lib import UsageContext logger = init_logger(name) ENGINE_ITERATION_TIMEOUT_S = int( os.environ.get("VLLM_ENGINE_ITERATION_TIMEOUT_S", "60"))