vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.61k stars 4.07k forks source link

[Performance]: How use vllm.attention.ops.triton_flash_attention replace flash_attn package #5534

Open Arcmoon-Hu opened 3 months ago

Arcmoon-Hu commented 3 months ago

Proposal to improve performance

My gpu is tool old so that can't install flash_attn package. So, I want use vllm.attention.ops.triton_flash_attention replace flash_attn package

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
simon-mo commented 3 months ago

The triton kernel is tailored to AMD at the moment. I would recommend setting VLLM_ATTENTION_BACKEND=xformers instead