vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.84k stars 3.08k forks source link

[Performance]: About the use of flash_attn_varlen_func() #5702

Open xwentian2020 opened 1 week ago

xwentian2020 commented 1 week ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

Hi, vllm developers,

I read the code and found the use of flash attention. This algorithm used in vllm likely for the sake of conducting the pre-filling stage more quickly. Am i right in thinking so? BTW, the vllm code used flash_attn_varlen_func(), instead of other implementations of FA, e.g., lash_attn_func, flash_attn_kvpacked_func, flash_attn_qkvpacked_func, flash_attn_varlen_kvpacked_func, flash_attn_varlen_qkvpacked_func, and flash_attn_with_kvcache. Could you share with me the consideration made for this selection? Is it selected for its speed better than other implementations as well? Is there a remarked difference between it and other implementations in the situation of vllm?

Thanks.

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
youkaichao commented 1 week ago

cc @WoosukKwon for flash attention