Closed yipy0005 closed 1 week ago
This happened for me when using an RTX 4090 with a compute capability <8.0. If you go with an A100 and compute capability >8.0, as recommended, you should be avoiding this problem.
Hi @yipy0005, you can use the flag --flash_attention_implementation=xla
(as outlined in our performance docs) to disable flash attention. As @championsnet says, this will work on GPUs with compute capability <8.0. Closing the issue for now, but please feel free to re-open if you still run into any issues! :)
Just wanted to flag that we improved the error message for this in 3599612
This happened for me when using an RTX 4090 with a compute capability <8.0. If you go with an A100 and compute capability >8.0, as recommended, you should be avoiding this problem.
Just to point out that the GeForce RTX 4090 has a compute capability > 8.0. Not that it matters, but in fact it has a higher compute capability (8.9) than the A100 (8.0). https://developer.nvidia.com/cuda-gpus
I was running the following:
and I got the error
ValueError: implementation='triton' is unsupported on this GPU generation.
when it was running model inference.Appreciate if I could get some help on this. Thank you! 😁