heungson commented 4 months ago

Your current environment

docker image: vllm/vllm-openai:0.4.2 Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ GPUs: RTX8000 * 2

🐛 Describe the bug

The model works fine until the following error is raised.

INFO 05-26 22:28:18 async_llm_engine.py:529] Received request cmpl-10dff83cb4b6422ba8c64213942a7e46: prompt: '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>"Question: Is Korea the name of a Nation?\nGuideline: No explanation.\nFormat: {"Answer": "<your yes/no answer>"}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['---'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [5, 5, 255000, 255006, 9, 60478, 33, 3294, 13489, 1690, 2773, 1719, 1671, 20611, 38, 206, 46622, 7609, 33, 3679, 33940, 21, 206, 8961, 33, 19586, 61664, 2209, 31614, 28131, 20721, 22, 3598, 11205, 37, 22631, 255001, 255000, 255007], lora_request: None. INFO 05-26 22:28:18 async_llm_engine.py:154] Aborted request cmpl-10dff83cb4b6422ba8c64213942a7e46. INFO: 10.11.3.150:6231 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step request_outputs = await self.engine.step_async() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 221, in step_async output = await self.model_executor.execute_model_async( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 110, in execute_model_async all_outputs = await self._run_workers_async("execute_model", File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 326, in _run_workers_async all_outputs = await asyncio.gather(*coros) asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for return fut.result() asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app response = await func(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator async for res in result_generator: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 660, in generate async for request_output in stream: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in anext raise result File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in call return await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app response = await func(request) File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 99, in create_chat_completion generator = await openai_serving_chat.create_chat_completion( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 138, in create_chat_completion return await self.chat_completion_full_generator( File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 301, in chat_completion_full_generator async for res in result_generator: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 666, in generate raise e File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 650, in generate stream = await self.add_request( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 537, in add_request self.start_background_loop() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 411, in start_background_loop raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already

Warrior0x1 commented 4 months ago

I encountered a similar error which was a serious bug in production

BKsirius commented 4 months ago

Hi! I got the same error.! img_v3_02b9_3f9c6c3b-342c-4e01-aa94-fa6b1139e80g

agahEbrahimi commented 4 months ago

I'm facing the same error in production.

DarkLight1337 commented 4 months ago

~~IIRC, this has been fixed by #4363, which should be in the next release. I haven't rigorously tested whether it specifically fixes this problem though.~~ Doesn't look like it, based on the newly opened issues.

Related issues:

2000
3310
3839
4000
4135
4293
5443
5732
5822

mavericb commented 4 months ago

I also encountered this serious bug. It's impossible to deploy in prod since it fails unexpectedly and doesn't even restart the system. I tried the pre-release in 0.4.3, but the bug still persists 😭

Just adding some more info. I can call the endpoint from three terminals at the same time and it seems to survive, but the bug comes again when calling the endpoint with 4 terminals. So, it's problematic to deploy something like this in production where multiple simultaneous calls can happen.

Edit: additional info

Penglikai commented 3 months ago

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?

tommyil commented 3 months ago

I am experiencing a similar issue, with Llama3

albertsokol commented 3 months ago

I am facing the same error, wonder if it's solved in v0.5.0, anyone tested on it?

I've also been experiencing the same issue, using Llama 3 70b, in v0.5.0.

tmostak commented 3 months ago

Also hitting this with Llama 3 70b in v0.5.0. All the times it has been triggered have been when using guided_regex (via the OpenAI API), fwiw, where it happens very frequently.

EDIT: Actually just hit it without the guided_regex argument.

valeriylo commented 3 months ago

Facing the same error, but it seems it is related with long context length. When setting at around 98k and over, the Avg generation throughput stays at 0.0 tokens/s and after 6 messages there's loop error from engine. Looks like it needs more time to process long request and api itself cuts off

tommyil commented 3 months ago

Update: Turns out the image I used had an outdated version installed. I upgraded vLLM to version 0.5.0.post1, and the error hasn't recurred.

More background/log info on this: This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth).

The symptoms are a stopped background loop, without the ability to recover.

Log:

- 2024-06-22T05:51:06.490+00:00 INFO:     10.42.20.211:38860 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T05:51:06.490+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T05:51:06.490+00:00 Traceback (most recent call last):
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T05:51:06.490+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T05:51:06.490+00:00     return await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T05:51:06.490+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, _send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T05:51:06.490+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T05:51:06.490+00:00     await route.handle(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T05:51:06.490+00:00     await self.app(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T05:51:06.490+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     raise exc
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T05:51:06.490+00:00     await app(scope, receive, sender)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T05:51:06.490+00:00     response = await func(request)
- 2024-06-22T05:51:06.490+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T05:51:06.491+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T05:51:06.491+00:00     return await dependant.call(**values)
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 71, in health
- 2024-06-22T05:51:06.491+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T05:51:06.491+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 711, in check_health
- 2024-06-22T05:51:06.491+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T05:51:06.491+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
- 2024-06-22T05:51:08.799+00:00 INFO 06-22 05:51:08 metrics.py:229] Avg prompt throughput: 149.9 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 6 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 43.4%, CPU KV cache usage: 0.0%

tommyil commented 3 months ago

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.

- 2024-06-22T20:17:30.186+00:00 INFO:     10.42.15.50:60988 - "GET /health HTTP/1.1" 500 Internal Server Error
- 2024-06-22T20:17:30.187+00:00 ERROR:    Exception in ASGI application
- 2024-06-22T20:17:30.187+00:00 Traceback (most recent call last):
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
- 2024-06-22T20:17:30.187+00:00     result = await app(  # type: ignore[func-returns-value]
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
- 2024-06-22T20:17:30.187+00:00     return await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
- 2024-06-22T20:17:30.187+00:00     await super().__call__(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, _send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
- 2024-06-22T20:17:30.187+00:00     await self.middleware_stack(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
- 2024-06-22T20:17:30.187+00:00     await route.handle(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
- 2024-06-22T20:17:30.187+00:00     await self.app(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
- 2024-06-22T20:17:30.187+00:00     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     raise exc
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
- 2024-06-22T20:17:30.187+00:00     await app(scope, receive, sender)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
- 2024-06-22T20:17:30.187+00:00     response = await func(request)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
- 2024-06-22T20:17:30.187+00:00     raw_response = await run_endpoint_function(
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
- 2024-06-22T20:17:30.187+00:00     return await dependant.call(**values)
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 84, in health
- 2024-06-22T20:17:30.187+00:00     await openai_serving_chat.engine.check_health()
- 2024-06-22T20:17:30.187+00:00   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 842, in check_health
- 2024-06-22T20:17:30.187+00:00     raise AsyncEngineDeadError("Background loop is stopped.")
- 2024-06-22T20:17:30.187+00:00 vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.

And here is the cause for that - Cuda out of memory:

- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Engine background task failed
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] Traceback (most recent call last):
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return_value = task.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     has_requests_in_progress = await asyncio.wait_for(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return fut.result()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     request_outputs = await self.engine.step_async()
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await self.model_executor.execute_model_async(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = await make_async(self.driver_worker.execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     result = self.fn(*self.args, **self.kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     output = self.model_runner.execute_model(seq_group_metadata_list,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return func(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = model_executable(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 371, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.model(input_ids, positions, kv_caches,
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 288, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states, residual = layer(
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 237, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     hidden_states = self.mlp(hidden_states)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 80, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     x = self.act_fn(gate_up)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._call_impl(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return forward_call(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 13, in forward
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     return self._forward_method(*args, **kwargs)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/activation.py", line 36, in forward_cuda
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52]     out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
- 2024-06-22T20:16:34.700+00:00 ERROR 06-22 20:16:34 async_llm_engine.py:52] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB. GPU 
- 2024-06-22T20:16:34.700+00:00 Exception in callback functools.partial(<function _log_task_completion at 0x7f10e052ecb0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f10c8a0e530>>)

Could there be a memory leak?

My card uses 21GB out of 24GB.

DarkLight1337 commented 3 months ago

Here's the up-to-date error, that occurs sporadically, also on version 0.5.0post1.

This is from Nvidia A10, with a Llama3 8B foundation, and a fine-tuned Qlora adapter (trained with unsloth). The symptoms are a stopped background loop, without the ability to recover.

And here is the cause for that - Cuda out of memory:

Could there be a memory leak?

My card uses 21GB out of 24GB.

5355 may fix your particular problem. To temporarily circumvent this issue, you can set `gpu_memory_utilization` to a lower value (the default is 0.9).

trislee02 commented 3 months ago

Is there any update on this?

robertgshaw2-neuralmagic commented 3 months ago

Hello! I am systematically tracking AsyncEngineDeadError in #5901

To help us understand what is going on, I need to reproduce the errors on my side.

If you can share:

Server launch command
Example requests that cause the server to crash

It is much easier for me to look into what is going on.

robertgshaw2-neuralmagic commented 3 months ago

cc @Warrior0x1, @BKsirius, @agahEbrahimi, @mavericb, @Penglikai, @tommyil, @albertsokol, @tmostak, @valeriylo, @trislee02

trislee02 commented 3 months ago

Thank you for your prompt reply. Here's what I ran: 1. Server launch command

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2-72B-Instruct-AWQ --tensor-parallel-size 2 --enforce-eager --quantization awq --gpu-memory-utilization 0.98 --max-model-len 77500

In particular, I enabled YARN as instructed here to process long context (over the original 32K).

2. Example requests that cause the server to crash A long request of approximately 76,188 tokens (using OpenAI tokenizer) is attached below. long_request.txt

The error I got:

INFO:     129.126.125.252:12223 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/opt/conda/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 780, in _process_request
    raise e
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 776, in _process_request
    async for request_output in stream:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 89, in __anext__
    raise result
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/conda/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 282, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 482, in chat_completion_full_generator
    async for res in result_generator:
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

Before this error, it showed the input token as this exerpt:

1030, 894, 1290, 311, 6286, 70164, 594, 829, 382, 61428, 49056, 0, 70164, 0, 70164, 8958, 43443, 17618, 13, 17426, 13, 330, 40, 3278, 1977, 432, 15356, 358, 1366, 311, 0, 70164, 0, 79123, 2293, 1837, 42246, 264, 2805, 707, 83, 7203, 8364, 84190, 14422, 1059, 19142, 448, 806, 1787, 1424, 382, 12209, 1052, 1033, 35177, 52884, 5193, 279, 14852, 6422, 11, 323, 3198, 594, 23314, 1136, 14995, 11, 323, 1550, 916, 279, 21340, 264, 1293, 10865, 289, 604, 315, 6646, 13, 4392, 13, 25636, 2127, 264, 95328, 504, 806, 653, 2986, 323, 3855, 304, 264, 294, 9832, 8841, 279, 6006, 13, 3197, 566, 1030, 8048, 4279, 1616, 566, 6519, 2163, 323, 44035, 518, 279, 6109, 2293, 25235, 7403, 323, 41563, 1136, 14995, 323, 1585, 84569, 438, 807, 49057, 1588, 323, 1052, 4221, 279, 38213, 14549, 448, 9709, 315, 12296, 11, 323, 279, 45896, 287, 7071, 389, 279, 26148, 34663, 19660, 4402, 323, 4460, 311, 8865, 264, 2975, 315, 330, 68494, 350, 4626, 1, 916, 279, 16971, 4617, 16065, 315, 24209, 86355, 13, 5005, 4392, 13, 25636, 2127, 6519, 323, 8570, 389, 700, 279, 6006, 13, 35825, 847, 8896, 504, 279, 521, 78121, 358, 8110, 382, 1, 28851, 311, 15786, 1045, 1899, 1335, 566, 11827, 11, 438, 582, 10487, 51430, 1495, 304, 279, 38636, 382, 1, 9064, 11974, 1, 73325, 2217, 1, 19434, 697, 6078, 1007, 279, 27505, 1335, 47010, 279, 38636, 8171, 382, 7044, 2148, 697, 64168, 1335, 1053, 4392, 13, 25636, 2127, 448, 37829, 11, 330, 40, 3207, 944, 1414, 358, 572, 30587, 432, 2217, 67049, 1290, 1335, 358, 7230, 11, 330, 40, 3278, 387, 15713, 311, 2217, 13, 659, 659, 358, 572, 11259, 29388, 806, 4845, 323, 566, 572, 11699, 705, 1948, 279, 24140, 11, 82758, 304, 806, 54144, 11, 448, 264, 2244, 19565, 304, 806, 6078, 382, 1, 93809, 323, 279, 33182, 659, 659, 659, 48181, 54698, 659, 659, 659, 10621, 95581, 33292, 659, 659, 659, 15605, 43786, 19874, 659, 659, 659, 659, 1837, 12209, 358, 572, 20446, 4279, 32073, 304, 279, 9255, 4722, 2188, 315, 279, 19771, 16629, 11, 36774, 518, 279, 6556, 330, 51, 1897, 2886, 1, 323, 8580, 369, 279, 3040, 297, 62410, 5426, 382, 13874, 19324, 4340, 1657, 31365, 4278, 525, 1052, 304, 419, 2197, 30, 10479, 6139, 1105, 30, 151645, 198, 151644, 77091, 198], lora_request: None.

After that, it kept showing Running: 1 reqs, but no generation.

INFO 06-28 15:41:57 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:07 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:17 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.
INFO 06-28 15:42:27 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 86.2%, CPU KV cache usage: 0.0%.

Thank you in advance!

richarddli commented 2 months ago

I can trigger this error reliably when sending requests with larger amounts of tokens. I've reproduced this on both meta-llama/Meta-Llama-3-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.1.

In my situation, I'm deploying vLLM on a CPU-only 32GB Intel system, and then running inference through the OpenAI endpoint.

Aillian commented 2 months ago

getting the same error, here is my launch command: python -m vllm.entrypoints.openai.api_server --model Model_Files --dtype bfloat16 --chat-template chat_template.jinja --device cuda --enable-prefix-caching

richarddli commented 2 months ago

I did some additional experimentation:

On a 64GB VM, CPU only, I was able to successfully trigger the error with a 351 token prompt.
On a 128GB VM, CPU only, the 351 token prompt did not trigger an error.
On the 128GB VM, CPU only, a 604 token prompt does trigger the error.

I'm using the same Docker image of vLLM, reasonably close to tip, and built using the Dockerfile.cpu, with meta-llama/Meta-Llama-3-8B-Instruct.

trislee02 commented 2 months ago

Is there any update on this? Thanks.

shangzyu commented 2 months ago

Same problem. But when I reduce the max-num-seqs from 256 to 16. This error disappeared.

richarddli commented 2 months ago

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

TangJiakai commented 2 months ago

Same Error!!!

mangomatrix commented 1 month ago

Same error: https://github.com/vllm-project/vllm/issues/6689#issuecomment-2272718218 Only find on Lllma 3.1, 70B 405B-fp8, llama3 70B is right! Useing vllm package has no problem:

from vllm import LLM

llm = LLM("/mnt/models/Meta-Llama-3.1-405B-Instruct-FP8", tensor_parallel_size=8, max_model_len=10600)

server wrong:

vllm serve /mnt/models/Meta-Llama-3-70B-Instruct --tensor-parallel-size 8 --api-key token-abc123 --host 0.0.0.0 --port 8899 --max-model-len 81920

yitianlian commented 1 month ago

same error!!

caoxu915683474 commented 1 month ago

same error!!

yckbilly1929 commented 1 month ago

I've had some success by increasing ENGINE_ITERATION_TIMEOUT_S.

It appears the offending code is here: (see https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L630). When the engine takes too long, it times out, but then leaves the engine in a dead state. I'm not familiar enough with the internals of vLLM to suggest a fix.

That said, if someone does have an idea of how to fix it, I'm happy to try to implement a fix.

This works. I set ENGINE_ITERATION_TIMEOUT_S to 180 to align with GraphRAG default value timeout=configuration.request_timeout or 180.0,. The default value of 60 is not enough sometimes.

endNone commented 3 weeks ago

I have meet the same issue. From my point of view, it happens in two scenes: One situation is under heavy request pressure (like a graphrag), the other is situation is uncertain. I deployed the service to the production environment, and this kind of error appears after about 20 days, even though there was no such pressure at that time. Such errors are hard to reproduce. Initially, I suspected it was due to unstable network connections, but I quickly ruled out that possibility.I believe that setting ENGINE_ITERATION_TIMEOUT_S is effective in the first situation, but it may not necessarily work in the second one.

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 506, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 235, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 166, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 149, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 42, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 532, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.9/site-packages/starlette/responses.py", line 250, in stream_response
    async for chunk in self.body_iterator:
  File "/root/Futuregene/FastChat/fastchat/serve/vllm_worker.py", line 196, in generate_stream
    async for request_output in results_generator:
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 673, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 767, in _process_request
    stream = await self.add_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 572, in add_request
    self.start_background_loop()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 443, in start_background_loop
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

mangomatrix commented 3 weeks ago

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

richarddli commented 3 weeks ago

It seems to be a VRAM leak issue. When your VRAM is insufficient to run the model inference, the server will stop functioning, but the client won’t detect it.

I read a bunch of the code a couple months ago. When you send a request to vLLM, it gets queued for processing. There is a timeout associated with this request, that is governed by ENGINE_ITERATION_TIMEOUT_S. When a request exceeds the timeout, an AsyncEngineDeadError is thrown. I put together a hacky patch that simply removed the request from queue, returning an error to the caller. This way the caller can then choose how it wants to handle a 500 response (retry, ignore, etc.). I did ping a few vLLM folks to review my patch, but never heard back from them.

So hopefully someone who is more familiar with vLLM internals than me can investigate. I'm not sure if there is a VRAM leak issue or not (I certainly got the error frequently enough on new instances, which suggests it's not a leak), but I do think the semantics of the queue are incorrect.

ashwin-js commented 2 weeks ago

Hi I am using Triton server to host my engine. I am getting the same issue. Can some explain how to set ENGINE_ITERATION_TIMEOUT_S triton server ?

khayamgondal commented 2 weeks ago

@ashwin-js were you able to figure it out?

Silas-Xu commented 2 weeks ago

same error👀

ashwin-js commented 2 weeks ago

@khayamgondal No but I got a workaround, I exported the trtion metrics and Keeping the GPU utils below 85%. I am not seeing any error. and in model.json I kept the gpu_memory_utilization to 0.90

Silas-Xu commented 2 weeks ago

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:

vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"

The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.

ericalduo commented 1 week ago

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%.

My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.

@Silas-Xu Same as you. Have you resolved it?

Silas-Xu commented 1 week ago

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%. My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.
@Silas-Xu Same as you. Have you resolved it?

I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

ericalduo commented 1 week ago

I have stopped the service requests, but there is still a ghost request that continues to run, and the GPU KV cache usage keeps increasing until it reaches 100%. My startup command is:
vllm serve /root/Llama3.1-8b-Instruct --dtype auto --api-key token-abc123 --max_model_len 39680 --served-model-name "llama3.1-8b"
The vllm version is 0.6.1.post2, and the requests are made using concurrent batch requests (using from openai import AsyncOpenAI), GPU is 4090.
@Silas-Xu Same as you. Have you resolved it?
I have tried various methods, including upgrading to the latest version and using different parameters, but none have been successful. It is said that using a smaller gpu_memory_utilization parameter can solve this problem, but it may cause an inability to start, with a message indicating insufficient GPU memory.

i running ‘glm-4-9b-gptq-int4’ model on RTX 4090 with gpu_memory_utilization=0.5, But the model still reports this error.

vllm-project / vllm

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

Your current environment

🐛 Describe the bug

The model works fine until the following error is raised.

2000

3310

3839

4000

4135

4293

5443

5732

5822

5355 may fix your particular problem. To temporarily circumvent this issue, you can set `gpu_memory_utilization` to a lower value (the default is 0.9).

vllm-project / vllm

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already. #5060

Your current environment

🐛 Describe the bug

The model works fine until the following error is raised.

2000

3310

3839

4000

4135

4293

5443

5732

5822

5355 may fix your particular problem. To temporarily circumvent this issue, you can set gpu_memory_utilization to a lower value (the default is 0.9).

5355 may fix your particular problem. To temporarily circumvent this issue, you can set `gpu_memory_utilization` to a lower value (the default is 0.9).