Open ravil-mobile opened 1 week ago
@sjw, @antiagainst @zhanglx13, could you, please, review the code?
The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that
The PR title is misleading. We don't need anything special for flash-attention like kernels. All we need is to add a new sched_variant, in this case "guard" or some better name, so that
- It can be set as kernel arg
- If set, we insert sched.barrier at loop boundaries if the loop contains at least one dotOp.
@zhanglx13 @zhanglx13
I'd like to propose something different - i.e., a new instruction in our dialect which is dedicated to instruction scheduling guards (triton::amdgpu::InstructionSchedGuard
). We would need to introduce a dedicated conversion pass as well. This allows us to put guards independent on the number of tt.DotOps
in a region. Moreover, the code will become more readable because of the separation of concerns between scheduling and guarding logic.
local_prefetch
) can add InstructionSchedGuard
to a region which is going to be lowered to the corresponding LLVM intrinsic calls laterscf.ForOps
for-loop
via a dedicated tt.range
parameter
Extended AMDGPU instruction scheduling.
The introduced source code changes add
sched.barriers
at the beginning and at the end of eachscf.For
op (calledguards
). The guards prevent moves of instructions from basic block adjacent to the bodies forfor-loops
. According to test results, it results in increase performance for the FA-like kernels due to a reduction of VGPRs spilling.[x] I am not making a trivial change, such as fixing a typo in a comment.
[x] I have written a PR description following these rules.
[x] I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsSelect one of the following.
lit
tests.lit
tests I have added follow these best practices, including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)