[Performance]: Qwen2-VL-7B AWQ model performance

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.99k stars 4.71k forks source link

[Performance]: Qwen2-VL-7B AWQ model performance #9863

Open zzf2grx opened 4 weeks ago

zzf2grx commented 4 weeks ago

Proposal to improve performance

Hi~ I find the inference time of Qwen2-VL-7B AWQ is not improved too much compared to Qwen2-VL-7B. Do you have any suggestions about improving performance. Thank you!

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 4 weeks ago

I think the inference time may be dominated by the preprocessing, so it might not be related to the model itself. See #9238 for more details.

zzf2grx commented 3 weeks ago

I think the inference time may be dominated by the preprocessing, so it might not be related to the model itself. See #9238 for more details.

But in lmdeploy, awq quantization models are about 2x fast compared to fp models. Is there any method to improve the speed of awq or other quantization models?

DarkLight1337 commented 3 weeks ago

I think the inference time may be dominated by the preprocessing, so it might not be related to the model itself. See #9238 for more details.

But in lmdeploy, awq quantization models are about 2x fast compared to fp models. Is there any method to improve the speed of awq or other quantization models?

This is only a problem for Qwen2-VL in particular, because their image preprocessing is very slow. It should not be a problem for other AWQ models.

zzf2grx commented 3 weeks ago

I think the inference time may be dominated by the preprocessing, so it might not be related to the model itself. See #9238 for more details.

But in lmdeploy, awq quantization models are about 2x fast compared to fp models. Is there any method to improve the speed of awq or other quantization models?

This is only a problem for Qwen2-VL in particular, because their image preprocessing is very slow. It should not be a problem for other AWQ models.

So is there any advice on how to improve the speed of image preprocessing?

DarkLight1337 commented 3 weeks ago

So is there any advice on how to improve the speed of image preprocessing?

You can try passing smaller images to the model.