Open Kelcin2 opened 3 months ago
This is a known issue. Please provide more details about your error in #5901 so we can better investigate the cause!
This is a known issue. Please provide more details about your error in #5901 so we can better investigate the cause!
I build a image with dockerfile Dockerfile.cpu
.
I added extra environment variables:
VLLM_CPU_KVCACHE_SPACE=4
and extra boot params:
python3 -m vllm.entrypoints.openai.api_server --served-model-name Meta-Llama-3.1-8B-Instruct --model /workspace/models/Meta-Llama-3.1-8B-Instruct --max-model-len 32768
and send a request to chat, request payload is below:
{
"model": "Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who won the world series in 2020?"
}
]
}
and wait for a while then threw some exceptions mentioned above
I can provide some extra environment information.
I start a minikube with kubernetes env:
minikube start -n 1 --memory=24g --cpus=10 --kubernetes-version=v1.30.0 --disk-size=120g --namespace kelcin --extra-config=apiserver.service-node-port-range=1-65535
and locally build a image tagged with kelcin-vllm-cpu-env:2.0.1
by docker build Dockerfile.cpu,
buid image command: docker build -f Dockerfile.cpu -t kelcin-vllm-cpu-env:2.0.1 --shm-size=20g .
and create a helm chart
vllm-helm-0.1.0.zip
type helm install command(don't forget to put model file into target direcotry):
helm install vllm ./vllm-helm-0.1.0.tgz
wait for a while until the pod is ready.
call api /v1/models
result is correct
but call api /v1/chat/completions
will occur exception mentioned above:
request payload is:
{
"model": "Meta-Llama-3.1-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who won the world series in 2020?"
}
]
}
Exception is
INFO 07-28 07:34:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO: 10.244.0.1:47570 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 10.244.0.1:47582 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:34:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO: 10.244.0.1:33206 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:29 logger.py:36] Received request chat-abdc43944b884d3d853965bceac6f110: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho won the world series in 2020?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 15546, 2834, 279, 1917, 4101, 304, 220, 2366, 15, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 07-28 07:35:29 async_llm_engine.py:173] Added request chat-abdc43944b884d3d853965bceac6f110.
INFO 07-28 07:35:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO: 10.244.0.1:57848 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 10.244.0.1:57836 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO: 10.244.0.1:52104 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:36:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
ERROR 07-28 07:36:29 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 07-28 07:36:29 async_llm_engine.py:56] Engine background task failed
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56] done, _ = await asyncio.wait(
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 07-28 07:36:29 async_llm_engine.py:56] return await _wait(fs, timeout, return_when, loop)
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 07-28 07:36:29 async_llm_engine.py:56] await waiter
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 07-28 07:36:29 async_llm_engine.py:56]
ERROR 07-28 07:36:29 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 07-28 07:36:29 async_llm_engine.py:56]
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 07-28 07:36:29 async_llm_engine.py:56] return_value = task.result()
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 07-28 07:36:29 async_llm_engine.py:56] self._do_exit(exc_type)
ERROR 07-28 07:36:29 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 07-28 07:36:29 async_llm_engine.py:56] raise asyncio.TimeoutError
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-07-28 07:36:29,312 - base_events.py - asyncio - ERROR - Exception in callback _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 07-28 07:36:29 async_llm_engine.py:180] Aborted request chat-abdc43944b884d3d853965bceac6f110.
INFO: 127.0.0.1:37684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 130, in create_chat_completion
generator = await openai_serving_chat.create_chat_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 197, in create_chat_completion
return await self.chat_completion_full_generator(
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 448, in chat_completion_full_generator
async for res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 772, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 888, in _process_request
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 884, in _process_request
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 93, in __anext__
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
INFO 07-28 07:36:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
Has this problem been solved? I have the same mistake as you and it has been bothering me for two weeks.
Has this problem been solved? I have the same mistake as you and it has been bothering me for two weeks.
@TianSongS no, it's a bug. wating for owner fixing it
Is your CPU not compatible with AVX512? it can be run if the CPU supports AVX512. The inference on a CPU that does not support AVX512 outputs at a rate of 0.05 tokens per second, which leads to a timeout. That's my guess.
Your current environment
🐛 Describe the bug
Throw exceptions