vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.68k stars 4.66k forks source link

[Usage]: Engine iteration timed out. (during using qwen2-vl-7b) #10123

Open HuiyuanYan opened 2 weeks ago

HuiyuanYan commented 2 weeks ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I I tried deploying qwen2-vl-7b using vllm with commands:

VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=4,5,6,7 nohup python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model MY_MODEL_PATH --pipeline-parallel-size 4  --gpu-memory-utilization 0.8 --limit-mm-per-prompt image=4 --port 11435 --disable-custom-all-reduce >> nohup_logs/vllm.log 2>&1 &

Please forgive me for the complex parameter settings, as I conducted numerous searches and attempts to successfully deploy and added many parameters to ensure it works.

And my device configuration is as follows:

System:
Ubuntu 20.04.4 LTS

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          96
On-line CPU(s) list:             0-95
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         3283.879
CPU max MHz:                     4000.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       1.5 MiB
L1i cache:                       1.5 MiB
L2 cache:                        48 MiB
L3 cache:                        71.5 MiB
NUMA node0 CPU(s):               0-23,48-71
NUMA node1 CPU(s):               24-47,72-95

GPU:
NVIDIA RTX A6000 * 8 (use 4 of them)

The package list of my python virtual environment is as follows:

accelerate                        1.0.1
aiohappyeyeballs                  2.4.2
aiohttp                           3.10.8
aiolimiter                        1.1.0
aiosignal                         1.3.1
altair                            5.4.1
annotated-types                   0.7.0
antlr4-python3-runtime            4.9.3
anyio                             4.6.0
asttokens                         2.4.1
async-timeout                     4.0.3
attrs                             24.2.0
av                                13.0.0
beautifulsoup4                    4.12.3
bleach                            6.1.0
blessed                           1.20.0
blinker                           1.8.2
blis                              0.7.11
braceexpand                       0.1.7
cachetools                        5.5.0
catalogue                         2.0.10
certifi                           2024.8.30
cfgv                              3.4.0
charset-normalizer                3.3.2
click                             8.1.7
cloudpathlib                      0.19.0
cloudpickle                       3.0.0
compressed-tensors                0.6.0
confection                        0.1.5
contexttimer                      0.3.3
contourpy                         1.3.0
cycler                            0.12.1
cymem                             2.0.8
datasets                          3.0.1
decorator                         5.1.1
decord                            0.6.0
dill                              0.3.8
diskcache                         5.6.3
distlib                           0.3.8
distro                            1.9.0
einops                            0.8.0
exceptiongroup                    1.2.2
executing                         2.1.0
fairscale                         0.4.4
fastapi                           0.115.0
filelock                          3.16.1
flash-attn                        2.6.3
fonttools                         4.54.1
frozenlist                        1.4.1
fsspec                            2024.6.1
ftfy                              6.2.3
gguf                              0.10.0
gitdb                             4.0.11
GitPython                         3.1.43
gpustat                           1.1.1
h11                               0.14.0
httpcore                          1.0.5
httptools                         0.6.1
httpx                             0.27.2
huggingface-hub                   0.25.2
identify                          2.6.1
idna                              3.10
imageio                           2.35.1
importlib_metadata                8.5.0
interegular                       0.3.3
iopath                            0.1.10
ipython                           8.27.0
jedi                              0.19.1
Jinja2                            3.1.4
jiter                             0.5.0
joblib                            1.4.2
jsonschema                        4.23.0
jsonschema-specifications         2023.12.1
kaggle                            1.6.17
kiwisolver                        1.4.7
langcodes                         3.4.1
language_data                     1.2.0
lark                              1.2.2
lazy_loader                       0.4
llvmlite                          0.43.0
lm-format-enforcer                0.10.6
marisa-trie                       1.2.0
markdown-it-py                    3.0.0
MarkupSafe                        2.1.5
matplotlib                        3.9.2
matplotlib-inline                 0.1.7
mdurl                             0.1.2
mistral_common                    1.4.4
modelscope                        1.16.1
mpmath                            1.3.0
msgpack                           1.1.0
msgspec                           0.18.6
multidict                         6.1.0
multiprocess                      0.70.16
murmurhash                        1.0.10
narwhals                          1.8.3
nest-asyncio                      1.6.0
networkx                          3.3
nltk                              3.9.1
nodeenv                           1.9.1
numba                             0.60.0
numpy                             1.26.4
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 9.1.0.70
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.560.30
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.6.68
nvidia-nvtx-cu12                  12.1.105
omegaconf                         2.3.0
openai                            1.50.2
opencv-python-headless            4.5.5.64
opendatasets                      0.1.22
outlines                          0.0.46
packaging                         24.1
pandas                            2.2.3
parso                             0.8.4
partial-json-parser               0.2.1.1.post4
peft                              0.13.1
pexpect                           4.9.0
pillow                            10.4.0
pip                               24.2
platformdirs                      4.3.6
plotly                            5.24.1
portalocker                       2.10.1
pre-commit                        3.8.0
preshed                           3.0.9
prometheus_client                 0.21.0
prometheus-fastapi-instrumentator 7.0.0
prompt_toolkit                    3.0.48
protobuf                          5.28.2
psutil                            6.0.0
ptyprocess                        0.7.0
pure_eval                         0.2.3
py-cpuinfo                        9.0.0
py-spy                            0.3.14
pyairports                        2.1.1
pyarrow                           17.0.0
pycocoevalcap                     1.2
pycocotools                       2.0.8
pycountry                         24.6.1
pydantic                          2.9.2
pydantic_core                     2.23.4
pydeck                            0.9.1
Pygments                          2.18.0
pyparsing                         3.1.4
python-dateutil                   2.9.0.post0
python-dotenv                     1.0.1
python-magic                      0.4.27
python-slugify                    8.0.4
pytz                              2024.2
PyYAML                            6.0.2
pyzmq                             26.2.0
qwen-vl-utils                     0.0.8
ray                               2.37.0
referencing                       0.35.1
regex                             2024.9.11
requests                          2.32.3
rich                              13.8.1
rpds-py                           0.20.0
safetensors                       0.4.5
salesforce-lavis                  1.0.2
scikit-image                      0.24.0
scikit-learn                      1.5.2
scipy                             1.14.1
sentencepiece                     0.2.0
setuptools                        75.1.0
shellingham                       1.5.4
six                               1.16.0
smart-open                        7.0.4
smmap                             5.0.1
sniffio                           1.3.1
soupsieve                         2.6
spacy                             3.7.6
spacy-legacy                      3.0.12
spacy-loggers                     1.0.5
srsly                             2.4.8
stack-data                        0.6.3
starlette                         0.38.6
streamlit                         1.38.0
sympy                             1.13.3
tenacity                          8.5.0
text-generation                   0.7.0
text-unidecode                    1.3
thinc                             8.2.5
threadpoolctl                     3.5.0
tifffile                          2024.9.20
tiktoken                          0.7.0
timm                              0.4.12
tokenizers                        0.20.3
toml                              0.10.2
torch                             2.4.0
torchvision                       0.19.0
tornado                           6.4.1
tqdm                              4.66.5
traitlets                         5.14.3
transformers                      4.46.2
triton                            3.0.0
typer                             0.12.5
typing_extensions                 4.12.2
tzdata                            2024.2
urllib3                           2.2.3
uvicorn                           0.31.0
uvloop                            0.20.0
virtualenv                        20.26.5
vllm                              0.6.3.post1
vllm-flash-attn                   2.6.1
wasabi                            1.1.3
watchdog                          4.0.2
watchfiles                        0.24.0
wcwidth                           0.2.13
weasel                            0.4.1
webdataset                        0.2.100
webencodings                      0.5.1
websockets                        13.1
wheel                             0.44.0
wrapt                             1.16.0
xformers                          0.0.27.post2
xxhash                            3.5.0
yarl                              1.13.1
zipp                              3.20.2

Here is the exception I found:

During the inference, I found that if the prompt is short (or there are few images passed in), the model runs normally in most cases. However, when the prompt is long (or there are many images passed in), the VLLM program will get stuck or generate errors and output the following error message:

ERROR 11-07 23:24:51 async_llm_engine.py:865] Engine iteration timed out. This should never happen!
INFO 11-07 23:24:51 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%.
ERROR 11-07 23:24:51 async_llm_engine.py:64] Engine background task failed
ERROR 11-07 23:24:51 async_llm_engine.py:64] Traceback (most recent call last):
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
ERROR 11-07 23:24:51 async_llm_engine.py:64]     await asyncio.sleep(0)
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
ERROR 11-07 23:24:51 async_llm_engine.py:64]     await __sleep0()
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
ERROR 11-07 23:24:51 async_llm_engine.py:64]     yield
ERROR 11-07 23:24:51 async_llm_engine.py:64] asyncio.exceptions.CancelledError
ERROR 11-07 23:24:51 async_llm_engine.py:64] 
ERROR 11-07 23:24:51 async_llm_engine.py:64] During handling of the above exception, another exception occurred:
ERROR 11-07 23:24:51 async_llm_engine.py:64] 
ERROR 11-07 23:24:51 async_llm_engine.py:64] Traceback (most recent call last):
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
ERROR 11-07 23:24:51 async_llm_engine.py:64]     return_value = task.result()
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
ERROR 11-07 23:24:51 async_llm_engine.py:64]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 11-07 23:24:51 async_llm_engine.py:64]     self._do_exit(exc_type)
ERROR 11-07 23:24:51 async_llm_engine.py:64]   File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 11-07 23:24:51 async_llm_engine.py:64]     raise asyncio.TimeoutError
ERROR 11-07 23:24:51 async_llm_engine.py:64] asyncio.exceptions.TimeoutError
Exception in callback functools.partial(<function _log_task_completion at 0x7fb78e7bf400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fb78213a170>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7fb78e7bf400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fb78213a170>>)>
Traceback (most recent call last):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
    await asyncio.sleep(0)
  File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
    await __sleep0()
  File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
    yield
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
    return_value = task.result()
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 66, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO:     127.0.0.1:55016 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 849, in run_engine_loop
    await asyncio.sleep(0)
  File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 596, in sleep
    await __sleep0()
  File "/home/.conda/envs/python/lib/python3.10/asyncio/tasks.py", line 590, in __sleep0
    yield
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/.conda/envs/python/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/home/.conda/envs/python/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 315, in create_chat_completion
    generator = await chat(raw_request).create_chat_completion(
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
    return await self.chat_completion_full_generator(
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_chat.py", line 624, in chat_completion_full_generator
    async for res in result_generator:
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/utils.py", line 458, in iterate_with_cancellation
    item = await awaits[0]
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 1029, in generate
    async for output in await self.add_request(
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 112, in generator
    raise result
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 54, in _log_task_completion
    return_value = task.result()
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 844, in run_engine_loop
    async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/home/.conda/envs/python/lib/python3.10/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
CRITICAL 11-07 23:24:52 launcher.py:88] AsyncLLMEngine is already dead, terminating server process
INFO:     127.0.0.1:58202 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [787966]

The subprocesses of vllm also do not automatically terminate and continue to occupy the GPU unless you manually kill them.

I have tried various measures, including but not limited to: changing the -tp parameter to -pp, adding the --disable-custom-all-reduce parameter, reducing the --gpu-memory-utilization, and upgrading the vllm version, but the situation has not improved yet.

All I hoped is someone can help me to ensure that qwen2-vl-7b can perform inference stably within its context length range(32768, including image tokens) without experiencing llm_engine crashes or similar issues. :penguin:

Before submitting a new issue...

DarkLight1337 commented 2 weeks ago

I think this is potentially caused by the long processing time, as documented in #9238. You can try preprocessing the images to be smaller before passing them to vLLM and/or set max_pixels via --mm-processor-kwargs.