[Bug]: Using CPU for inference, an error occurred. [Engine iteration timed out. This should never happen! ]

liuzhipengchd commented 2 months ago

I compiled the vllm0.5.4 using the CPU, which does not support AVX512. After compiling, I entered the container and executed the command to start the llama3-8b model.

```text python3 -m vllm.entrypoints.openai.api_server --model /data/WiNGPT2-Llama-3-8B-Chat --served-model-name Llama3-8B --port 8001 ``` ```text from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8001/v1" role='You are an intelligent assistant for processing text.' text='Who are you?' client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) result = client.chat.completions.create( model="Llama3-8B", messages=messages,temperature=0.3 ,stream=True ) for chunk in result: if chunk.choices[0].delta.content != None: content = chunk.choices[0].delta.content print(content) ```

🐛 Describe the bug

3, 112471, 128001, 198, 72803, 5232], lora_request: None, prompt_adapter_request: None. INFO: 127.0.0.1:48252 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 08-21 07:31:22 async_llm_engine.py:174] Added request chat-0eefb9c0183b4a2197d1408cd47717ce. INFO 08-21 07:31:28 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:38 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:08 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:18 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. ERROR 08-21 07:32:22 async_llm_engine.py:663] Engine iteration timed out. This should never happen! ERROR 08-21 07:32:22 async_llm_engine.py:57] Engine background task failed ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop ERROR 08-21 07:32:22 async_llmengine.py:57] done, = await asyncio.wait( ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait ERROR 08-21 07:32:22 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait ERROR 08-21 07:32:22 async_llm_engine.py:57] await waiter ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.CancelledError ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] During handling of the above exception, another exception occurred: ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion ERROR 08-21 07:32:22 async_llm_engine.py:57] return_value = task.result() ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop ERROR 08-21 07:32:22 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in aexit ERROR 08-21 07:32:22 async_llm_engine.py:57] self._do_exit(exc_type) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 08-21 07:32:22 async_llm_engine.py:57] raise asyncio.TimeoutError ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.TimeoutError INFO 08-21 07:32:22 async_llm_engine.py:181] Aborted request chat-0eefb9c0183b4a2197d1408cd47717ce. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fc250d76260

During handling of the above exception, another exception occurred:

Exception Group Traceback (most recent call last): | File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi | result = await app( # type: ignore[func-returns-value] | File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call | return await self.app(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call | await super().call(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call | await self.middleware_stack(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call | raise exc | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call | await self.app(scope, receive, _send) | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call | await self.app(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app | raise exc | File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app | await app(scope, receive, sender) | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in call | await self.middleware_stack(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app | await route.handle(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle | await self.app(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app | await wrap_app_handling_exceptions(app, request)(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app | raise exc | File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app | await app(scope, receive, sender) | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app | await response(scope, receive, send) | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call | async with anyio.create_task_group() as task_group: | File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in aexit | raise BaseExceptionGroup( | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ---------------- | Traceback (most recent call last): | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap | await func() | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response | async for chunk in self.body_iterator: | File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 218, in chat_completion_stream_generator | async for res in result_generator: | File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/rpc/client.py", line 216, in generate | raise request_output | asyncio.exceptions.TimeoutError +------------------------------------ INFO 08-21 07:32:28 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:38 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

liuzhipengchd commented 2 months ago

How to solve this problem? Does my CPU not support inference?

ilya-lavrenov commented 2 months ago

Hi @liuzhipengchd You can try to run on CPU via OpenVINO backend https://docs.vllm.ai/en/latest/getting_started/openvino-installation.html

liuzhipengchd commented 2 months ago

@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?

ilya-lavrenov commented 2 months ago

@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?

@HPUedCSLearner Hi，If VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS is not enabled, will the inference speed be very slow? What about the concurrency?

You can see the difference here https://github.com/vllm-project/vllm/pull/5379#issue-2344007685 if you open spoilers with plots. You can see that FP16 model performs even better.

Generally, int4 / int8 should have better performance, but currently, we have extra optimizations for FP16 weights and that is why it has better performance. We are in progress of dynamic quantization enabling using AMX which will fully utilize compressed weights.

JNewman-cell commented 1 month ago

Setting the environment variable ENGINE_ITERATION_TIMEOUT_S to >60 increases the async io timeout which may be causing the issue because inference on CPU is very slow.

vllm-project / vllm

[Bug]: Using CPU for inference, an error occurred. [Engine iteration timed out. This should never happen! ] #7722

🐛 Describe the bug