Support computation pipelining after SWP refactoring

With the recent SWP refactoring, it is much easier to support arbitrary stage assignments where computations can be separated into different stages. Computation pipelining is basically splitting computations to different stages. Take flash attention as an example: Currently the two loads are in stage 0 (S0), all other ops are in the last stage (stage 2). The loop body will look like MMA0(i) Softmax(i) MUL(i) MMA1(i) LoadV(i+2) LoadK(i+2)

This patch defines two different pipeline schedule for attention-like kernels: 1> putting first dot in S2, other computations in S3, loadK in stage 0, loadV in stage 1 MMA0(i+1)
Softmax(i) MUL(i) MMA1(i) loadK(i+3) loadV(i+2) 2> putting second dot in S3, other computations in S2, loadK in stage 0, loadV in stage 1 MMA0(i+1)
MMA1(i)
Softmax(i+1)
MUL(i+1)
loadK(i+3)
loadV(i+2)

Preliminary performance number on H100 for flash attention: (Batch, Heads, SeqLen, Dhead) triton_tutorial_flash_v2_opt-tflops triton_tutorial_flash_v2_tma-tflops triton_tutorial_flash_v2-tflops

         (8, 16, 8192, 128)                                517.528                                504.565                            481.402

The implementation and the frontend is preliminary for discussion.

triton-lang / triton

Support computation pipelining after SWP refactoring #5185