Closed PanJason closed 3 days ago
We also ran llava2 on A20 and saw a similar case
I think we have found the reason. This is because all the pytorch cuda operations are asynchronous which means the torch functions are just wrappers to launch the kernel and torch ensures the correct order with synchronization. See more details in Asynchronous execution in pytorch documentation.
This teach us a lesson: for all torch cuda related operations, use torch.cuda.event
to measure time and one should never use time.perf_counter()
which is only for end2end measurement
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
I am trying to understand how prefix caching helps with prompts of different lengths, so I ran the benchmark_prefix_caching.py with different prompt lengths (basically
PROMPT * n
in the code). I also changed thenum_prompts
to 1 and set the output_len to 1. I also changed themodel_runner.py
to understand where the benefit comes from. IMO, the benefit should come from model execution because prefix caching reduces the complexity of attention ofO(n^2)
toO(n)
. I added the following timestamps for my understanding:Basically, to get the time spent on model execution, logit computation, and sampling, respectively.
I set
n=40
(prompt len = 30121) and tested a llama-7b variant which supports a long context window. The commands I ran are as follows:The result I got when I enabled prefix caching is:
Without prefix caching, it is:
I wonder why the sample time is decreased dramatically rather than the time when the
model_executable
is invoked. I further identified the place insampler.py
which costs the most time. By default, vllm uses the greedy sampler and the following place ingreedy_sampler
is the most expensive:This is very different from what I thought. One reason I can think of is that there is a future which is only invoked when
tolist()
is called and the model executable is not really executing the cuda kernel synchronously. I guess this may relate to cuda graph but I saw the above result both with cuda graph enabled and disabled.I also tried different models (llama-2-7b, llama-2-13b) with different prompt lengths. I saw similar issues happening where the most time reduction comes from sampling rather than
model_executable
so I am really confused now. Is there anyone who has looked into the execution with prefix caching in detail? Thanks a lot!Your current environment (if you think it is necessary)