I read the code and found the use of flash attention. This algorithm used in vllm likely for the sake of conducting the pre-filling stage more quickly. Am i right in thinking so? BTW, the vllm code used flash_attn_varlen_func(), instead of other implementations of FA, e.g., lash_attn_func, flash_attn_kvpacked_func, flash_attn_qkvpacked_func, flash_attn_varlen_kvpacked_func, flash_attn_varlen_qkvpacked_func, and flash_attn_with_kvcache. Could you share with me the consideration made for this selection? Is it selected for its speed better than other implementations as well? Is there a remarked difference between it and other implementations in the situation of vllm?
Thanks.
Your current environment (if you think it is necessary)
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
Hi, vllm developers,
I read the code and found the use of flash attention. This algorithm used in vllm likely for the sake of conducting the pre-filling stage more quickly. Am i right in thinking so? BTW, the vllm code used flash_attn_varlen_func(), instead of other implementations of FA, e.g., lash_attn_func, flash_attn_kvpacked_func, flash_attn_qkvpacked_func, flash_attn_varlen_kvpacked_func, flash_attn_varlen_qkvpacked_func, and flash_attn_with_kvcache. Could you share with me the consideration made for this selection? Is it selected for its speed better than other implementations as well? Is there a remarked difference between it and other implementations in the situation of vllm?
Thanks.
Your current environment (if you think it is necessary)