vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.84k stars 4.5k forks source link

[Bug]: Engine iteration timed out. This should never happen! #6790

Open Kelcin2 opened 3 months ago

Kelcin2 commented 3 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-472.el9.x86_64-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        45 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
CPU family:                           6
Model:                                158
Thread(s) per core:                   1
Core(s) per socket:                   2
Socket(s):                            6
Stepping:                             10
BogoMIPS:                             5184.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat md_clear flush_l1d arch_capabilities
Hypervisor vendor:                    VMware
Virtualization type:                  full
L1d cache:                            384 KiB (12 instances)
L1i cache:                            384 KiB (12 instances)
L2 cache:                             3 MiB (12 instances)
L3 cache:                             54 MiB (6 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-11
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.3.100+git0eb3473
[pip3] numpy==1.26.4
[pip3] torch==2.3.1+cpu
[pip3] torchvision==0.18.1+cpu
[pip3] transformers==4.43.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

🐛 Describe the bug

Throw exceptions

INFO 07-25 15:46:17 logger.py:36] Received request chat-bbbaddce58a249d994a645d9ce8bf7ae: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho won the world series in 2020?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 15546, 2834, 279, 1917, 4101, 304, 220, 2366, 15, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 07-25 15:46:17 async_llm_engine.py:173] Added request chat-bbbaddce58a249d994a645d9ce8bf7ae.
INFO 07-25 15:46:19 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-25 15:46:29 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:57656 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-25 15:46:39 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-25 15:46:49 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-25 15:46:59 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:49152 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-25 15:47:09 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
ERROR 07-25 15:47:17 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 07-25 15:47:17 async_llm_engine.py:56] Engine background task failed
ERROR 07-25 15:47:17 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
ERROR 07-25 15:47:17 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 07-25 15:47:17 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 07-25 15:47:17 async_llm_engine.py:56]     await waiter
ERROR 07-25 15:47:17 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 07-25 15:47:17 async_llm_engine.py:56] 
ERROR 07-25 15:47:17 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 07-25 15:47:17 async_llm_engine.py:56] 
ERROR 07-25 15:47:17 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 07-25 15:47:17 async_llm_engine.py:56]     return_value = task.result()
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 07-25 15:47:17 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 07-25 15:47:17 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 07-25 15:47:17 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 07-25 15:47:17 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 07-25 15:47:17 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-07-25 15:47:17,939 - base_events.py - asyncio - ERROR - Exception in callback _log_task_completion(error_callback=<bound method...7fb66aea9690>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7fb66aea9690>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
INFO 07-25 15:47:17 async_llm_engine.py:180] Aborted request chat-bbbaddce58a249d994a645d9ce8bf7ae.
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO:     127.0.0.1:49340 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 130, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 197, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 448, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 888, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 884, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
DarkLight1337 commented 3 months ago

This is a known issue. Please provide more details about your error in #5901 so we can better investigate the cause!

Kelcin2 commented 3 months ago

This is a known issue. Please provide more details about your error in #5901 so we can better investigate the cause!

I build a image with dockerfile Dockerfile.cpu.

I added extra environment variables: VLLM_CPU_KVCACHE_SPACE=4

and extra boot params: python3 -m vllm.entrypoints.openai.api_server --served-model-name Meta-Llama-3.1-8B-Instruct --model /workspace/models/Meta-Llama-3.1-8B-Instruct --max-model-len 32768

and send a request to chat, request payload is below:

{
    "model": "Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Who won the world series in 2020?"
        }
    ]
}

and wait for a while then threw some exceptions mentioned above

Kelcin2 commented 3 months ago

I can provide some extra environment information. I start a minikube with kubernetes env: minikube start -n 1 --memory=24g --cpus=10 --kubernetes-version=v1.30.0 --disk-size=120g --namespace kelcin --extra-config=apiserver.service-node-port-range=1-65535

and locally build a image tagged with kelcin-vllm-cpu-env:2.0.1 by docker build Dockerfile.cpu, buid image command: docker build -f Dockerfile.cpu -t kelcin-vllm-cpu-env:2.0.1 --shm-size=20g . and create a helm chart vllm-helm-0.1.0.zip

type helm install command(don't forget to put model file into target direcotry): helm install vllm ./vllm-helm-0.1.0.tgz

wait for a while until the pod is ready. image

call api /v1/models result is correct

image

but call api /v1/chat/completions will occur exception mentioned above:

request payload is:

{
    "model": "Meta-Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Who won the world series in 2020?"
        }
    ]
}

Exception is

INFO 07-28 07:34:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:47570 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     10.244.0.1:47582 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:34:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:33206 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:29 logger.py:36] Received request chat-abdc43944b884d3d853965bceac6f110: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho won the world series in 2020?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=None, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 15546, 2834, 279, 1917, 4101, 304, 220, 2366, 15, 30, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
INFO 07-28 07:35:29 async_llm_engine.py:173] Added request chat-abdc43944b884d3d853965bceac6f110.
INFO 07-28 07:35:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:35:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:57848 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     10.244.0.1:57836 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:35:53 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:03 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:13 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO:     10.244.0.1:52104 - "GET /v1/models HTTP/1.1" 200 OK
INFO 07-28 07:36:23 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
ERROR 07-28 07:36:29 async_llm_engine.py:658] Engine iteration timed out. This should never happen!
ERROR 07-28 07:36:29 async_llm_engine.py:56] Engine background task failed
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 07-28 07:36:29 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 07-28 07:36:29 async_llm_engine.py:56]     await waiter
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 07-28 07:36:29 async_llm_engine.py:56] 
ERROR 07-28 07:36:29 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 07-28 07:36:29 async_llm_engine.py:56] 
ERROR 07-28 07:36:29 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 07-28 07:36:29 async_llm_engine.py:56]     return_value = task.result()
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
ERROR 07-28 07:36:29 async_llm_engine.py:56]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 07-28 07:36:29 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 07-28 07:36:29 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 07-28 07:36:29 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 07-28 07:36:29 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-07-28 07:36:29,312 - base_events.py - asyncio - ERROR - Exception in callback _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7f54d868d4e0>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 07-28 07:36:29 async_llm_engine.py:180] Aborted request chat-abdc43944b884d3d853965bceac6f110.
INFO:     127.0.0.1:37684 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 631, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 130, in create_chat_completion
    generator = await openai_serving_chat.create_chat_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 197, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 448, in chat_completion_full_generator
    async for res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 772, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 888, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 884, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 630, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.5.3.post1+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
INFO 07-28 07:36:33 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-28 07:36:43 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
TianSongS commented 2 months ago

Has this problem been solved? I have the same mistake as you and it has been bothering me for two weeks.

Kelcin2 commented 2 months ago

Has this problem been solved? I have the same mistake as you and it has been bothering me for two weeks.

@TianSongS no, it's a bug. wating for owner fixing it

liuzhipengchd commented 2 months ago

Is your CPU not compatible with AVX512? it can be run if the CPU supports AVX512. The inference on a CPU that does not support AVX512 outputs at a rate of 0.05 tokens per second, which leads to a timeout. That's my guess.