Why fused attention is only applicable on Ampere GPUs? - Githubissues

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

12.53k stars 1.52k forks source link

Why fused attention is only applicable on Ampere GPUs? #1279

Open rayleizhu opened 1 year ago

rayleizhu commented 1 year ago

Hi, I'm writing my operator using fused attention as a template. However, I found that fused attention requires an Ampere arch:

https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L200

I do not understand this.

Does it mean this template uses some arch-specific operators?
To use it on Volta GPU, how should I modify it?

Besides, it seems that only head_dim=64 is supported, right? How can I fix it for the head_dim=32 case?

https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207

ptillet commented 1 year ago

there is some more information in https://github.com/openai/triton/issues/616