Flash Attention 3 --> Triton

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

12.49k stars 1.51k forks source link

Flash Attention 3 --> Triton #4308

Open jenkspt opened 1 month ago

jenkspt commented 1 month ago

Flash attention 3 makes use of new features of the Hopper architecture.

(async) WGMMA
TMA
overlap softmax

Are these all things that can currently (or in the future) be optimized automatically with the triton compiler? And could the fused attention implementation from https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html make use of these without changes?

manman-ren commented 1 month ago

Triton OSS currently uses WGMMA (async). TMA is still experimental, but work is ongoing for improving the descriptors. For computation overlapping, I am actually trying to see if we can modify the existing SWP to enable specifying stages/clusters for computation ops.