Open soacker opened 3 weeks ago
Are you running with Xformers backend or FlashAttention backend?
Are you running with Xformers backend or FlashAttention backend?
Yeah. I use FlashAttention backend.
I encountered a similar problem. Greedy sampling, when the prompt length is 2000, output_len=1, the input request has no common prefix at all, all requests are different. The inference results of the first 7 requests are consistent with and without prefix caching, and the inference speed is also similar (3.4 token/s). The inference results of the 8th and subsequent requests are inconsistent with and without prefix caching, and the inference speed is particularly fast (17.6 token/s).please take a look at this bug, thank you vllm version: 0.5.0. 1*A100-40G , llama2-13b
Proposal to improve performance
No response
Report of performance regression
I load llama3-70b into 4 gpus.
I record the common_computed_block_nums and show its context_lens_tensor. Also, I record the model execute time. Interestingly, I find that (1)when request first hit the common prefix cached, the model execute is slow. And except the first hitting, other request hitting the prefix cached is fast. (2) no hitting prefix cached seems more fast.
I know that using KV cached means context attention forward within prefill stage, and no KV cache means full attention forward.
no hitting the prefix cached.
First hitting the prefix cached.
Second hitting the prefix cached.
Can somebody show me its reason, and are there some improvements?
Misc discussion on performance
No response
Your current environment (if you think it is necessary)