[Usage]: Qwen2VL model mrope implemenation in cuda graph

gujiewen commented 22 hours ago

Anything you want to discuss about vllm.

in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with

in RUNTIME. So, when input is text-only, the input positions is (seqlen). however, vllm's cuda graph use positions shape == (3, seqlen).

Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?

However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why. PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode. Left is cuda-graph, right is eager.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 19 hours ago

~~I believe this isn't implemented yet. @alex-jw-brooks do you have time to take this on?~~ Oops, wrong issue.

DarkLight1337 commented 19 hours ago

I'm not really experienced with CUDA graph, perhaps @youkaichao can help answer this question?

vllm-project / vllm

[Usage]: Qwen2VL model mrope implemenation in cuda graph #9546

Anything you want to discuss about vllm.

Before submitting a new issue...