Open heungson opened 4 months ago
I encountered a similar error which was a serious bug in production
Hi! I got the same error.!
I'm facing the same error in production.
IIRC, this has been fixed by #4363, which should be in the next release. I haven't rigorously tested whether it specifically fixes this problem though. Doesn't look like it, based on the newly opened issues.
Related issues:
I also encountered this serious bug. It's impossible to deploy in prod since it fails unexpectedly and doesn't even restart the system. I tried the pre-release in 0.4.3, but the bug still persists 😭
Just adding some more info. I can call the endpoint from three terminals at the same time and it seems to survive, but the bug comes again when calling the endpoint with 4 terminals. So, it's problematic to deploy something like this in production where multiple simultaneous calls can happen.
Edit: additional info
I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?
I am experiencing a similar issue, with Llama3
I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?
I've also been experiencing the same issue, using Llama 3 70b, in v0.5.0.
Also hitting this with Llama 3 70b in v0.5.0. All the times it has been triggered have been when using guided_regex
(via the OpenAI API), fwiw, where it happens very frequently.
EDIT: Actually just hit it without the guided_regex
argument.
Facing the same error, but it seems it is related with long context length. When setting at around 98k and over, the Avg generation throughput stays at 0.0 tokens/s and after 6 messages there's loop error from engine. Looks like it needs more time to process long request and api itself cuts off
Update: Turns out the image I used had an outdated version installed. I upgraded vLLM to version 0.5.0.post1, and the error hasn't recurred.
More background/log info on this: This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).
The symptoms are a stopped background loop, without the ability to recover.
Log:
- 2024-06-22T05:51:06.490+00:00 INFO: 10.42.20.211:38860 - "GET /health HTTP/1.1" 500 Internal Server Error - 2024-06-22T05:51:06.490+00:00 ERROR: Exception in ASGI application - 2024-06-22T05:51:06.490+00:00 Traceback (most recent call last): - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi - 2024-06-22T05:51:06.490+00:00 result = await app( # type: ignore[func-returns-value] - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__ - 2024-06-22T05:51:06.490+00:00 return await self.app(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ - 2024-06-22T05:51:06.490+00:00 await super().__call__(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__ - 2024-06-22T05:51:06.490+00:00 await self.middleware_stack(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__ - 2024-06-22T05:51:06.490+00:00 raise exc - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__ - 2024-06-22T05:51:06.490+00:00 await self.app(scope, receive, _send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__ - 2024-06-22T05:51:06.490+00:00 await self.app(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__ - 2024-06-22T05:51:06.490+00:00 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app - 2024-06-22T05:51:06.490+00:00 raise exc - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app - 2024-06-22T05:51:06.490+00:00 await app(scope, receive, sender) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__ - 2024-06-22T05:51:06.490+00:00 await self.middleware_stack(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app - 2024-06-22T05:51:06.490+00:00 await route.handle(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle - 2024-06-22T05:51:06.490+00:00 await self.app(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app - 2024-06-22T05:51:06.490+00:00 await wrap_app_handling_exceptions(app, request)(scope, receive, send) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app - 2024-06-22T05:51:06.490+00:00 raise exc - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app - 2024-06-22T05:51:06.490+00:00 await app(scope, receive, sender) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app - 2024-06-22T05:51:06.490+00:00 response = await func(request) - 2024-06-22T05:51:06.490+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app - 2024-06-22T05:51:06.491+00:00 raw_response = await run_endpoint_function( - 2024-06-22T05:51:06.491+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function - 2024-06-22T05:51:06.491+00:00 return await dependant.call(**values) - 2024-06-22T05:51:06.491+00:00 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 71, in health - 2024-06-22T05:51:06.491+00:00 await openai_serving_chat.engine.check_health() - 2024-06-22T05:51:06.491+00:00 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 711, in check_health - 2024-06-22T05:51:06.491+00:00 raise AsyncEngineDeadError("Background loop is stopped.") - 2024-06-22T05:51:06.491+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped. - 2024-06-22T05:51:08.799+00:00 INFO 06-22 05:51:08 metrics.py:229] Avg prompt throughput: 149.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.4%, CPU KV cache usage: 0.0%
Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.
This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.
- 2024-06-22T20:17:30.186+00:00 INFO: 10.42.15.50:60988 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T20:17:30.187+00:00 ERROR: Exception in ASGI application
- 2024-06-22T20:17:30.187+00:00 Traceback (most recent call last):
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T20:17:30.187+00:00 result = await app( # type: ignore[func-returns-value]
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T20:17:30.187+00:00 return await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T20:17:30.187+00:00 await super().__call__(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T20:17:30.187+00:00 await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T20:17:30.187+00:00 raise exc
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T20:17:30.187+00:00 await self.app(scope, receive, _send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T20:17:30.187+00:00 await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T20:17:30.187+00:00 await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00 raise exc
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00 await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T20:17:30.187+00:00 await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T20:17:30.187+00:00 await route.handle(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T20:17:30.187+00:00 await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T20:17:30.187+00:00 await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00 raise exc
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00 await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T20:17:30.187+00:00 response = await func(request)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T20:17:30.187+00:00 raw_response = await run_endpoint_function(
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T20:17:30.187+00:00 return await dependant.call(**values)
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
- 2024-06-22T20:17:30.187+00:00 await openai_serving_chat.engine.check_health()
- 2024-06-22T20:17:30.187+00:00 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 842, in check_health
- 2024-06-22T20:17:30.187+00:00 raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T20:17:30.187+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
And here is the cause for that - Cuda out of memory:
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Engine background task failed
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Traceback (most recent call last):
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return_value = task.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] has_requests_in_progress = await asyncio.wait_for(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return fut.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] request_outputs = await self.engine.step_async()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] output = await self.model_executor.execute_model_async(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] output = await make_async(self.driver_worker.execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] result = self.fn(*self.args, **self.kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] output = self.model_runner.execute_model(seq_group_metadata_list,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] hidden_states = model_executable(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] hidden_states = self.model(input_ids, positions, kv_caches,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] hidden_states, residual = layer(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 237, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] hidden_states = self.mlp(hidden_states)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 80, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] x = self.act_fn(gate_up)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] return self._forward_method(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/activation.py", line 36, in forward_cuda
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB. GPU
- 2024-06-22T20:16:34.700+00:00 Exception in callback functools.partial(<function _log_task_completion at 0x7f10e052ecb0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f10c8a0e530>>)
Could there be a memory leak?
My card uses 21GB out of 24GB.
Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.
This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.
And here is the cause for that - Cuda out of memory:
Could there be a memory leak?
My card uses 21GB out of 24GB.
gpu_memory_utilization
to a lower value (the default is 0.9).Is there any update on this?
Hello! I am systematically tracking AsyncEngineDeadError
in #5901
To help us understand what is going on, I need to reproduce the errors on my side.
If you can share:
It is much easier for me to look into what is going on.
cc @Warrior0x1, @BKsirius, @agahEbrahimi, @mavericb, @Penglikai, @tommyil, @albertsokol, @tmostak, @valeriylo, @trislee02
Thank you for your prompt reply. Here's what I ran: 1. Server launch command
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-AWQ --tensor-parallel-size 2 --enforce-eager --quantization awq --gpu-memory-utilization 0.98 --max-model-len 77500
In particular, I enabled YARN as instructed here to process long context (over the original 32K).
2. Example requests that cause the server to crash A long request of approximately 76,188 tokens (using OpenAI tokenizer) is attached below. long_request.txt
The error I got:
INFO: 129.126.125.252:12223 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
request_outputs = await self.engine.step_async()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
output = await self.model_executor.execute_model_async(
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
return await self.chat_completion_full_generator(
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
async for res in result_generator:
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
async for output in self._process_request(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
raise e
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
async for request_output in stream:
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
raise result
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
return_value = task.result()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
return await self.chat_completion_full_generator(
File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
async for res in result_generator:
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
async for output in self._process_request(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
stream = await self.add_request(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
self.start_background_loop()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
Before this error, it showed the input token as this exerpt:
1030, 894, 1290, 311, 6286, 70164, 594, 829, 382, 61428, 49056, 0, 70164, 0, 70164, 8958, 43443, 17618, 13, 17426, 13, 330, 40, 3278, 1977, 432, 15356, 358, 1366, 311, 0, 70164, 0, 79123, 2293, 1837, 42246, 264, 2805, 707, 83, 7203, 8364, 84190, 14422, 1059, 19142, 448, 806, 1787, 1424, 382, 12209, 1052, 1033, 35177, 52884, 5193, 279, 14852, 6422, 11, 323, 3198, 594, 23314, 1136, 14995, 11, 323, 1550, 916, 279, 21340, 264, 1293, 10865, 289, 604, 315, 6646, 13, 4392, 13, 25636, 2127, 264, 95328, 504, 806, 653, 2986, 323, 3855, 304, 264, 294, 9832, 8841, 279, 6006, 13, 3197, 566, 1030, 8048, 4279, 1616, 566, 6519, 2163, 323, 44035, 518, 279, 6109, 2293, 25235, 7403, 323, 41563, 1136, 14995, 323, 1585, 84569, 438, 807, 49057, 1588, 323, 1052, 4221, 279, 38213, 14549, 448, 9709, 315, 12296, 11, 323, 279, 45896, 287, 7071, 389, 279, 26148, 34663, 19660, 4402, 323, 4460, 311, 8865, 264, 2975, 315, 330, 68494, 350, 4626, 1, 916, 279, 16971, 4617, 16065, 315, 24209, 86355, 13, 5005, 4392, 13, 25636, 2127, 6519, 323, 8570, 389, 700, 279, 6006, 13, 35825, 847, 8896, 504, 279, 521, 78121, 358, 8110, 382, 1, 28851, 311, 15786, 1045, 1899, 1335, 566, 11827, 11, 438, 582, 10487, 51430, 1495, 304, 279, 38636, 382, 1, 9064, 11974, 1, 73325, 2217, 1, 19434, 697, 6078, 1007, 279, 27505, 1335, 47010, 279, 38636, 8171, 382, 7044, 2148, 697, 64168, 1335, 1053, 4392, 13, 25636, 2127, 448, 37829, 11, 330, 40, 3207, 944, 1414, 358, 572, 30587, 432, 2217, 67049, 1290, 1335, 358, 7230, 11, 330, 40, 3278, 387, 15713, 311, 2217, 13, 659, 659, 358, 572, 11259, 29388, 806, 4845, 323, 566, 572, 11699, 705, 1948, 279, 24140, 11, 82758, 304, 806, 54144, 11, 448, 264, 2244, 19565, 304, 806, 6078, 382, 1, 93809, 323, 279, 33182, 659, 659, 659, 48181, 54698, 659, 659, 659, 10621, 95581, 33292, 659, 659, 659, 15605, 43786, 19874, 659, 659, 659, 659, 1837, 12209, 358, 572, 20446, 4279, 32073, 304, 279, 9255, 4722, 2188, 315, 279, 19771, 16629, 11, 36774, 518, 279, 6556, 330, 51, 1897, 2886, 1, 323, 8580, 369, 279, 3040, 297, 62410, 5426, 382, 13874, 19324, 4340, 1657, 31365, 4278, 525, 1052, 304, 419, 2197, 30, 10479, 6139, 1105, 30, 151645, 198, 151644, 77091, 198], lora_request: None.
After that, it kept showing Running: 1 reqs
, but no generation.
INFO 06-28 15:41:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:07 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:17 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:27 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
Thank you in advance!
I can trigger this error reliably when sending requests with larger amounts of tokens. I've reproduced this on both meta-llama/Meta-Llama-3-8B-Instruct
and mistralai/Mistral-7B-Instruct-v0.1
.
In my situation, I'm deploying vLLM on a CPU-only 32GB Intel system, and then running inference through the OpenAI endpoint.
getting the same error, here is my launch command:
python -m vllm.entrypoints.openai.api_server --model Model_Files --dtype bfloat16 --chat-template chat_template.jinja --device cuda --enable-prefix-caching
I did some additional experimentation:
I'm using the same Docker image of vLLM, reasonably close to tip, and built using the Dockerfile.cpu
, with meta-llama/Meta-Llama-3-8B-Instruct
.
Is there any update on this? Thanks.
Same problem. But when I reduce the max-num-seqs from 256 to 16. This error disappeared.
I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S
.
It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.
That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.
Same Error!!!
Same error: https://github.com/vllm-project/vllm/issues/6689#issuecomment-2272718218 Only find on Lllma 3.1, 70B 405B-fp8, llama3 70B is right! Useing vllm package has no problem:
from vllm import LLM
llm = LLM("/mnt/models/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=10600)
server wrong:
vllm serve /mnt/models/Meta-Llama-3-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 8899 --max-model-len 81920
same error!!
same error!!
I've had some success by increasing
ENGINE_ITERATION_TIMEOUT_S
.It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.
That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.
This works. I set ENGINE_ITERATION_TIMEOUT_S to 180 to align with GraphRAG default value timeout=configuration.request_timeout or 180.0,
. The default value of 60 is not enough sometimes.
I have meet the same issue. From my point of view, it happens in two scenes: One situation is under heavy request pressure (like a graphrag), the other is situation is uncertain. I deployed the service to the production environment, and this kind of error appears after about 20 days, even though there was no such pressure at that time. Such errors are hard to reproduce. Initially, I suspected it was due to unstable network connections, but I quickly ruled out that possibility.I believe that setting ENGINE_ITERATION_TIMEOUT_S is effective in the first situation, but it may not necessarily work in the second one.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 250, in stream_response
async for chunk in self.body_iterator:
File "/root/Futuregene/FastChat/fastchat/serve/vllm_worker.py", line 196, in generate_stream
async for request_output in results_generator:
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
stream = await self.add_request(
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
self.start_background_loop()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.
It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.
I read a bunch of the code a couple months ago. When you send a request to vLLM, it gets queued for processing. There is a timeout associated with this request, that is governed by ENGINE_ITERATION_TIMEOUT_S
. When a request exceeds the timeout, an AsyncEngineDeadError
is thrown. I put together a hacky patch that simply removed the request from queue, returning an error to the caller. This way the caller can then choose how it wants to handle a 500 response (retry, ignore, etc.). I did ping a few vLLM folks to review my patch, but never heard back from them.
So hopefully someone who is more familiar with vLLM internals than me can investigate. I'm not sure if there is a VRAM leak issue or not (I certainly got the error frequently enough on new instances, which suggests it's not a leak), but I do think the semantics of the queue are incorrect.
Hi I am using Triton server to host my engine. I am getting the same issue. Can some explain how to set ENGINE_ITERATION_TIMEOUT_S
triton server ?
@ashwin-js were you able to figure it out?
same error👀
@khayamgondal No but I got a workaround, I exported the trtion metrics and Keeping the GPU utils below 85%. I am not seeing any error.
and in model.json I kept the gpu_memory_utilization
to 0.90
I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage
keeps increasing until it reaches 100%.
My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2
, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI
), GPU is 4090.
I have stopped the service requests, but there is still a ghost request that continues to run, and the
GPU KV cache usage
keeps increasing until it reaches 100%.My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is
0.6.1.post2
, and the requests are made using concurrent batch requests (usingfrom openai import AsyncOpenAI
), GPU is 4090.
@Silas-Xu Same as you. Have you resolved it?
I have stopped the service requests, but there is still a ghost request that continues to run, and the
GPU KV cache usage
keeps increasing until it reaches 100%. My startup command is:vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is
0.6.1.post2
, and the requests are made using concurrent batch requests (usingfrom openai import AsyncOpenAI
), GPU is 4090.@Silas-Xu Same as you. Have you resolved it?
I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization
parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.
I have stopped the service requests, but there is still a ghost request that continues to run, and the
GPU KV cache usage
keeps increasing until it reaches 100%. My startup command is:vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is
0.6.1.post2
, and the requests are made using concurrent batch requests (usingfrom openai import AsyncOpenAI
), GPU is 4090.@Silas-Xu Same as you. Have you resolved it?
I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller
gpu_memory_utilization
parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.
i running ‘glm-4-9b-gptq-int4’ model on RTX 4090 with gpu_memory_utilization=0.5, But the model still reports this error.
Your current environment
docker image: vllm/vllm-openai:0.4.2 Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ GPUs: RTX8000 * 2
🐛 Describe the bug
The model works fine until the following error is raised.
INFO 05-26 22:28:18 async_llm_engine.py:529] Received request cmpl-10dff83cb4b6422ba8c64213942a7e46: prompt: '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>"Question: Is Korea the name of a Nation?\nGuideline: No explanation.\nFormat: {"Answer": "<your yes/no answer>"}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['---'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5, 5, 255000, 255006, 9, 60478, 33, 3294, 13489, 1690, 2773, 1719, 1671, 20611, 38, 206, 46622, 7609, 33, 3679, 33940, 21, 206, 8961, 33, 19586, 61664, 2209, 31614, 28131, 20721, 22, 3598, 11205, 37, 22631, 255001, 255000, 255007], lora_request: None.
INFO 05-26 22:28:18 async_llm_engine.py:154] Aborted request cmpl-10dff83cb4b6422ba8c64213942a7e46.
INFO: 10.11.3.150:6231 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 110, in execute_model_async
all_outputs = await self._run_workers_async("execute_model",
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 326, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app response = await func(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator async for res in result_generator: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate async for request_output in stream: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in anext raise result File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app response = await func(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator async for res in result_generator: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 650, in generate stream = await self.add_request( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 537, in add_request self.start_background_loop() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already