in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with
in RUNTIME. So, when input is text-only, the input positions is (seqlen).
however, vllm's cuda graph use positions shape == (3, seqlen).
Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?
However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why.
PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode.
Left is cuda-graph, right is eager.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Anything you want to discuss about vllm.
in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with
in RUNTIME. So, when input is text-only, the input positions is (seqlen). however, vllm's cuda graph use positions shape == (3, seqlen).
Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?
However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why. PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode. Left is cuda-graph, right is eager.
Before submitting a new issue...