triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
13.5k stars 1.67k forks source link

[AMD] Add instruction schedule loop boundary guard hints #5163

Open ravil-mobile opened 1 week ago

ravil-mobile commented 1 week ago

Extended AMDGPU instruction scheduling.

ravil-mobile commented 1 week ago

@sjw, @antiagainst @zhanglx13, could you, please, review the code?

zhanglx13 commented 6 days ago

The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that

  1. It can be set as kernel arg
  2. If set, we insert sched.barrier at loop boundaries if the loop contains at least one dotOp.
ravil-mobile commented 6 days ago

The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that

  1. It can be set as kernel arg
  2. If set, we insert sched.barrier at loop boundaries if the loop contains at least one dotOp.

@zhanglx13 @zhanglx13

I'd like to propose something different - i.e., a new instruction in our dialect which is dedicated to instruction scheduling guards (triton::amdgpu::InstructionSchedGuard). We would need to introduce a dedicated conversion pass as well. This allows us to put guards independent on the number of tt.DotOps in a region. Moreover, the code will become more readable because of the separation of concerns between scheduling and guarding logic.

  1. Some instruction scheduling variants (e.g., local_prefetch) can add InstructionSchedGuard to a region which is going to be lowered to the corresponding LLVM intrinsic calls later
  2. A user can set a kernel argument to guard all scf.ForOps
  3. A user can set guards to a specific for-loop via a dedicated tt.range parameter