Open jenkspt opened 1 month ago
Triton OSS currently uses WGMMA (async). TMA is still experimental, but work is ongoing for improving the descriptors. For computation overlapping, I am actually trying to see if we can modify the existing SWP to enable specifying stages/clusters for computation ops.
Flash attention 3 makes use of new features of the Hopper architecture.
Are these all things that can currently (or in the future) be optimized automatically with the triton compiler? And could the fused attention implementation from https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html make use of these without changes?