[AMD] Add instruction schedule loop boundary guard hints

ravil-mobile commented 1 week ago

Extended AMDGPU instruction scheduling.

The introduced source code changes add sched.barriers at the beginning and at the end of each scf.For op (called guards). The guards prevent moves of instructions from basic block adjacent to the bodies for for-loops. According to test results, it results in increase performance for the FA-like kernels due to a reduction of VGPRs spilling.
[x] I am not making a trivial change, such as fixing a typo in a comment.
[x] I have written a PR description following these rules.
[x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- [x] I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
- [ ] This PR does not need a test because
Select one of the following.
- [ ] I have not added any lit tests.
- [x] The lit tests I have added follow these best practices, including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

ravil-mobile commented 1 week ago

@sjw, @antiagainst @zhanglx13, could you, please, review the code?

zhanglx13 commented 6 days ago

The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that

It can be set as kernel arg
If set, we insert sched.barrier at loop boundaries if the loop contains at least one dotOp.

ravil-mobile commented 6 days ago

The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that

It can be set as kernel arg

If set, we insert sched.barrier at loop boundaries if the loop contains at least one dotOp.

@zhanglx13 @zhanglx13

I'd like to propose something different - i.e., a new instruction in our dialect which is dedicated to instruction scheduling guards (triton::amdgpu::InstructionSchedGuard). We would need to introduce a dedicated conversion pass as well. This allows us to put guards independent on the number of tt.DotOps in a region. Moreover, the code will become more readable because of the separation of concerns between scheduling and guarding logic.

Some instruction scheduling variants (e.g., local_prefetch) can add InstructionSchedGuard to a region which is going to be lowered to the corresponding LLVM intrinsic calls later
A user can set a kernel argument to guard all scf.ForOps
A user can set guards to a specific for-loop via a dedicated tt.range parameter

triton-lang / triton

[AMD] Add instruction schedule loop boundary guard hints #5163