[Performance]: How use vllm.attention.ops.triton_flash_attention replace flash_attn package

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

27.61k stars 4.07k forks source link

Open Arcmoon-Hu opened 3 months ago

Arcmoon-Hu commented 3 months ago

My gpu is tool old so that can't install flash_attn package. So, I want use vllm.attention.ops.triton_flash_attention replace flash_attn package

No response

No response

The output of `python collect_env.py`

simon-mo commented 3 months ago

The triton kernel is tailored to AMD at the moment. I would recommend setting VLLM_ATTENTION_BACKEND=xformers instead