Open HuiyuanYan opened 2 weeks ago
I think this is potentially caused by the long processing time, as documented in #9238. You can try preprocessing the images to be smaller before passing them to vLLM and/or set max_pixels
via --mm-processor-kwargs
.
Your current environment
How would you like to use vllm
I I tried deploying
qwen2-vl-7b
using vllm with commands:Please forgive me for the complex parameter settings, as I conducted numerous searches and attempts to successfully deploy and added many parameters to ensure it works.
And my device configuration is as follows:
The package list of my python virtual environment is as follows:
Here is the exception I found:
During the inference, I found that if the prompt is short (or there are few images passed in), the model runs normally in most cases. However, when the prompt is long (or there are many images passed in), the VLLM program will get stuck or generate errors and output the following error message:
The subprocesses of vllm also do not automatically terminate and continue to occupy the GPU unless you manually kill them.
I have tried various measures, including but not limited to: changing the
-tp
parameter to-pp
, adding the--disable-custom-all-reduce
parameter, reducing the--gpu-memory-utilization
, and upgrading thevllm
version, but the situation has not improved yet.All I hoped is someone can help me to ensure that qwen2-vl-7b can perform inference stably within its context length range(32768, including image tokens) without experiencing llm_engine crashes or similar issues. :penguin:
Before submitting a new issue...