vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.88k stars 4.29k forks source link

[Usage]: Qwen2VL model mrope implemenation in cuda graph #9546

Open gujiewen opened 22 hours ago

gujiewen commented 22 hours ago

Anything you want to discuss about vllm.

in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with image

in RUNTIME. So, when input is text-only, the input positions is (seqlen). however, vllm's cuda graph use positions shape == (3, seqlen). image

Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?

However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why. PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode. Left is cuda-graph, right is eager. image

Before submitting a new issue...

DarkLight1337 commented 19 hours ago

I believe this isn't implemented yet. @alex-jw-brooks do you have time to take this on? Oops, wrong issue.

DarkLight1337 commented 19 hours ago

I'm not really experienced with CUDA graph, perhaps @youkaichao can help answer this question?