Open wlwqq opened 2 months ago
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
You are not using flashinfer backend if you encountered this error. This error message is reported by flash-attn package.
Try setting VLLM_ATTENTION_BACKEND=FLASHINFER
and run the script again.
see https://github.com/vllm-project/vllm/issues/8189#issuecomment-2332350166
@yzh119 we do have some todos for flashinfer backend, before that, flashinfer still depends on flash attention for prefill :(
Hi @youkaichao , thanks for letting me know! flashinfer v0.1.7 will be fully JIT, and I'll make it a pypi package which can be set as a vllm dependency. I'll keep you posted about the progress.
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ``` vllm == 0.5.5 FlashInfer==0.1.6+cu121torch2.4🐛 Describe the bug
when i use vLLm0.5.5 and FlashInfer0.1.6 to run Gemma-2-2b in T4. FlashInfer0.1.6 support T4: https://github.com/flashinfer-ai/flashinfer/releases but i see:
I'm not sure if it's a problem with vLLM's integration with FlashInfer @youkaichao @LiuXiaoxuanPKU
Before submitting a new issue...