vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.67k stars 4.48k forks source link

[Bug]: Crash after few multi image calls #8369

Closed Patrick10203 closed 1 month ago

Patrick10203 commented 1 month ago

Your current environment

Environment was set up by pulling the main branch and building the Dockerfile. Hardware was 4xA100 with an Azure Instance (Standard NC96ads A100 v4). Server image is: ubuntu-hpc (2204)

Startup: python3 -m vllm.entrypoints.openai.api_server --port=8000 --host=0.0.0.0 --chat-template="/docker_share/models/internVL2-template.jinja" --model="/fine_tunes/internvl2_76b_hermes2_llama3_70b_dynamic_res_2nd_finetune" --tensor-parallel-size=4 --max-model-len=8192 --trust_remote_code --enforce-eager --max-lora-rank 128 --limit-mm-per-prompt image=4

🐛 Describe the bug

I have build from source with the current main branch to use online multi image inference with internVL2 76B (finetuned). First few inferences work with no issue. After like 10 calls the server crashes with following stack trace

The issue occurs when callen multithreaded and single threaded. Somehow the bug doesnt happen when i remove --max-lora-rank 128 and set --max-model-len=6000

Stack trace ```text ERROR 09-11 05:24:13 async_llm_engine.py:63] Engine background task failed ERROR 09-11 05:24:13 async_llm_engine.py:63] Traceback (most recent call last): ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion ERROR 09-11 05:24:13 async_llm_engine.py:63] return_value = task.result() ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop ERROR 09-11 05:24:13 async_llm_engine.py:63] result = task.result() ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step ERROR 09-11 05:24:13 async_llm_engine.py:63] request_outputs = await self.engine.step_async(virtual_engine) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async ERROR 09-11 05:24:13 async_llm_engine.py:63] outputs = await self.model_executor.execute_model_async( ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async ERROR 09-11 05:24:13 async_llm_engine.py:63] return await self._driver_execute_model_async(execute_model_req) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async ERROR 09-11 05:24:13 async_llm_engine.py:63] return await self.driver_exec_model(execute_model_req) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run ERROR 09-11 05:24:13 async_llm_engine.py:63] result = self.fn(*self.args, **self.kwargs) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model ERROR 09-11 05:24:13 async_llm_engine.py:63] inputs = self.prepare_input(execute_model_req) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input ERROR 09-11 05:24:13 async_llm_engine.py:63] return self._get_driver_input_and_broadcast(execute_model_req) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast ERROR 09-11 05:24:13 async_llm_engine.py:63] self.model_runner.prepare_model_input( ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input ERROR 09-11 05:24:13 async_llm_engine.py:63] model_input = self._prepare_model_input_tensors( ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors ERROR 09-11 05:24:13 async_llm_engine.py:63] builder.add_seq_group(seq_group_metadata) ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group ERROR 09-11 05:24:13 async_llm_engine.py:63] per_seq_group_fn(inter_data, seq_group_metadata) ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input ERROR 09-11 05:24:13 async_llm_engine.py:63] mm_kwargs = self.multi_modal_input_mapper(mm_data) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input ERROR 09-11 05:24:13 async_llm_engine.py:63] input_dict = plugin.map_input(model_config, data_value) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input ERROR 09-11 05:24:13 async_llm_engine.py:63] return mapper(InputContext(model_config), data) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl ERROR 09-11 05:24:13 async_llm_engine.py:63] data = torch.stack(data) ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^ ERROR 09-11 05:24:13 async_llm_engine.py:63] RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1 Exception in callback functools.partial(, error_callback=>) handle: , error_callback=>)> Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async outputs = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async return await self._driver_execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async return await self.driver_exec_model(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model inputs = self.prepare_input(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input return self._get_driver_input_and_broadcast(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast self.model_runner.prepare_model_input( File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input model_input = self._prepare_model_input_tensors( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors builder.add_seq_group(seq_group_metadata) File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group per_seq_group_fn(inter_data, seq_group_metadata) File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input mm_kwargs = self.multi_modal_input_mapper(mm_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input input_dict = plugin.map_input(model_config, data_value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input return mapper(InputContext(model_config), data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl data = torch.stack(data) ^^^^^^^^^^^^^^^^^ RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1 The above exception was the direct cause of the following exception: Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.') ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last): ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate ERROR 09-11 05:24:13 client.py:412] await self.check_health(socket=socket) ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health ERROR 09-11 05:24:13 client.py:412] await self._send_one_way_rpc_request( ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request ERROR 09-11 05:24:13 client.py:412] raise response ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped. ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.') ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last): ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate ERROR 09-11 05:24:13 client.py:412] await self.check_health(socket=socket) ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health ERROR 09-11 05:24:13 client.py:412] await self._send_one_way_rpc_request( ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request ERROR 09-11 05:24:13 client.py:412] raise response ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped. CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process INFO: 10.151.92.18:51372 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process INFO: 10.151.92.18:51378 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [1009] INFO 09-11 05:24:13 server.py:228] vLLM ZMQ RPC Server was interrupted. Future exception was never retrieved future: Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate async for request_output in results_generator: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate async for output in await self.add_request( File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator raise result File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate async for request_output in results_generator: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate async for output in await self.add_request( File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator raise result File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async outputs = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async return await self._driver_execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async return await self.driver_exec_model(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model inputs = self.prepare_input(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input return self._get_driver_input_and_broadcast(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast self.model_runner.prepare_model_input( File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input model_input = self._prepare_model_input_tensors( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors builder.add_seq_group(seq_group_metadata) File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group per_seq_group_fn(inter_data, seq_group_metadata) File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input mm_kwargs = self.multi_modal_input_mapper(mm_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input input_dict = plugin.map_input(model_config, data_value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input return mapper(InputContext(model_config), data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl data = torch.stack(data) ^^^^^^^^^^^^^^^^^ RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1 ERROR 09-11 05:24:14 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1215 died, exit code: -15 INFO 09-11 05:24:14 multiproc_worker_utils.py:123] Killing local vLLM worker processes root@fee87fa97dfb:/vllm-workspace# /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ```

Before submitting a new issue...

DarkLight1337 commented 1 month ago

This is similar to #8361. @Isotr0py can you look into this? I think the issue stems from different images potentially having different sizes even after postprocessing.

Isotr0py commented 1 month ago

Seems that it's caused by different num_patches from different image size, similar to #7392.