vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.63k stars 3.9k forks source link

[Bug]: Using CPU for inference, an error occurred. [Engine iteration timed out. This should never happen! ] #7722

Open liuzhipengchd opened 3 weeks ago

liuzhipengchd commented 3 weeks ago

I compiled the vllm0.5.4 using the CPU, which does not support AVX512. After compiling, I entered the container and executed the command to start the llama3-8b model.

```text python3 -m vllm.entrypoints.openai.api_server --model /data/WiNGPT2-Llama-3-8B-Chat --served-model-name Llama3-8B --port 8001 ``` ```text from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8001/v1" role='You are an intelligent assistant for processing text.' text='Who are you?' client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) result = client.chat.completions.create( model="Llama3-8B", messages=messages,temperature=0.3 ,stream=True ) for chunk in result: if chunk.choices[0].delta.content != None: content = chunk.choices[0].delta.content print(content) ```

🐛 Describe the bug

3, 112471, 128001, 198, 72803, 5232], lora_request: None, prompt_adapter_request: None. INFO: 127.0.0.1:48252 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 08-21 07:31:22 async_llm_engine.py:174] Added request chat-0eefb9c0183b4a2197d1408cd47717ce. INFO 08-21 07:31:28 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:38 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:08 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:18 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. ERROR 08-21 07:32:22 async_llm_engine.py:663] Engine iteration timed out. This should never happen! ERROR 08-21 07:32:22 async_llm_engine.py:57] Engine background task failed ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop ERROR 08-21 07:32:22 async_llmengine.py:57] done, = await asyncio.wait( ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait ERROR 08-21 07:32:22 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait ERROR 08-21 07:32:22 async_llm_engine.py:57] await waiter ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.CancelledError ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] During handling of the above exception, another exception occurred: ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion ERROR 08-21 07:32:22 async_llm_engine.py:57] return_value = task.result() ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop ERROR 08-21 07:32:22 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in aexit ERROR 08-21 07:32:22 async_llm_engine.py:57] self._do_exit(exc_type) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 08-21 07:32:22 async_llm_engine.py:57] raise asyncio.TimeoutError ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.TimeoutError INFO 08-21 07:32:22 async_llm_engine.py:181] Aborted request chat-0eefb9c0183b4a2197d1408cd47717ce. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fc250d76260

During handling of the above exception, another exception occurred:

liuzhipengchd commented 3 weeks ago

How to solve this problem? Does my CPU not support inference?

ilya-lavrenov commented 3 weeks ago

Hi @liuzhipengchd You can try to run on CPU via OpenVINO backend https://docs.vllm.ai/en/latest/getting_started/openvino-installation.html

liuzhipengchd commented 3 weeks ago

@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?

ilya-lavrenov commented 3 weeks ago

@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?

@HPUedCSLearner Hi,If VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS is not enabled, will the inference speed be very slow? What about the concurrency?

You can see the difference here https://github.com/vllm-project/vllm/pull/5379#issue-2344007685 if you open spoilers with plots. You can see that FP16 model performs even better.

Generally, int4 / int8 should have better performance, but currently, we have extra optimizations for FP16 weights and that is why it has better performance. We are in progress of dynamic quantization enabling using AMX which will fully utilize compressed weights.