Open liuzhipengchd opened 2 months ago
How to solve this problem? Does my CPU not support inference?
Hi @liuzhipengchd You can try to run on CPU via OpenVINO backend https://docs.vllm.ai/en/latest/getting_started/openvino-installation.html
@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?
@ilya-lavrenov Thank you for your help, it's very useful to me. I have another question: if I quantize llama3 to int4 and use the CPU for inference, will the throughput of vllm be higher?
@HPUedCSLearner Hi,If VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS is not enabled, will the inference speed be very slow? What about the concurrency?
You can see the difference here https://github.com/vllm-project/vllm/pull/5379#issue-2344007685 if you open spoilers with plots. You can see that FP16 model performs even better.
Generally, int4 / int8 should have better performance, but currently, we have extra optimizations for FP16 weights and that is why it has better performance. We are in progress of dynamic quantization enabling using AMX which will fully utilize compressed weights.
Setting the environment variable ENGINE_ITERATION_TIMEOUT_S to >60 increases the async io timeout which may be causing the issue because inference on CPU is very slow.
I compiled the vllm0.5.4 using the CPU, which does not support AVX512. After compiling, I entered the container and executed the command to start the llama3-8b model.
🐛 Describe the bug
3, 112471, 128001, 198, 72803, 5232], lora_request: None, prompt_adapter_request: None. INFO: 127.0.0.1:48252 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 08-21 07:31:22 async_llm_engine.py:174] Added request chat-0eefb9c0183b4a2197d1408cd47717ce. INFO 08-21 07:31:28 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:38 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:48 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:31:58 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:08 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 08-21 07:32:18 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. ERROR 08-21 07:32:22 async_llm_engine.py:663] Engine iteration timed out. This should never happen! ERROR 08-21 07:32:22 async_llm_engine.py:57] Engine background task failed ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop ERROR 08-21 07:32:22 async_llmengine.py:57] done, = await asyncio.wait( ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait ERROR 08-21 07:32:22 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait ERROR 08-21 07:32:22 async_llm_engine.py:57] await waiter ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.CancelledError ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] During handling of the above exception, another exception occurred: ERROR 08-21 07:32:22 async_llm_engine.py:57] ERROR 08-21 07:32:22 async_llm_engine.py:57] Traceback (most recent call last): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion ERROR 08-21 07:32:22 async_llm_engine.py:57] return_value = task.result() ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop ERROR 08-21 07:32:22 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S): ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in aexit ERROR 08-21 07:32:22 async_llm_engine.py:57] self._do_exit(exc_type) ERROR 08-21 07:32:22 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.4+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit ERROR 08-21 07:32:22 async_llm_engine.py:57] raise asyncio.TimeoutError ERROR 08-21 07:32:22 async_llm_engine.py:57] asyncio.exceptions.TimeoutError INFO 08-21 07:32:22 async_llm_engine.py:181] Aborted request chat-0eefb9c0183b4a2197d1408cd47717ce. ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call await wrap(partial(self.listen_for_disconnect, receive)) File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap await func() File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect message = await receive() File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive await self.message_event.wait() File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait await fut asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fc250d76260
During handling of the above exception, another exception occurred: