Changes to SimplifyCFG to more aggressively eliminate paths with UB have resulted in a regression in the masked load path. When we have a masked load where other is 0 (the default), and the result of that load is used in the divisor of a division operation, the division op is marked UB on the masked path and deleted. This results in Triton always evaluating the unmasked load path, even if the loads were supposed to be masked. Discussing with the LLVM folks at Intel, it sounds like there won't be much community appetite to roll back this change (see https://llvm.org/docs/LangRef.html#undefined-values for more details). So, we likely need to address this in Triton. AMD and NVIDIA are not affected because they use predicated load instructions/assembly that LLVM cannot see through, whereas we have to use control flow in LLVM which of course the optimizer understands.
Here's a simple reproducer. This code simulates a masked load and store using the func argument %i.
this is b/c the %i false path caused UB with the division in %z.
Going forward we have a few options -
Modify the LLVM pass pipeline. The UB result of the false path is never used, and in fact in our desired result the false path ends up bypassing intructions in the if.end block entirely. If we can clean up the false path early in the pipeline, then the new behavior in SimplifyCFG will not affect us. We could introduce a new pass that looks for this pattern and adjusts it to remove UB, or we could create a mini pipeline before O3. But this will change code shape that could make optimizations later in the pass pipeline fail and degrade performance.
Introduce a new load instruction that hides the masking. I'm not sure how this would work - maybe a GenISA intrinsic that we rewrite at the very end? This may cause problems later in the pipeline. Note that I am working with IGC on predicated load instruction support which would solve this problem by making our backend the same as NVIDIA / AMD.
Try to get upstream LLVM to back out this portion of the change to SimplifyCFG. Because it happens fairly early in the O3 pipeline it is a very aggressive optimization, but the use of UB by Triton here is counter to LLVM's language manual.
I am planning to work on 1 for a bit while the IGC team works on 2. If those fail we can consider 3.
Changes to SimplifyCFG to more aggressively eliminate paths with UB have resulted in a regression in the masked load path. When we have a masked load where other is 0 (the default), and the result of that load is used in the divisor of a division operation, the division op is marked UB on the masked path and deleted. This results in Triton always evaluating the unmasked load path, even if the loads were supposed to be masked. Discussing with the LLVM folks at Intel, it sounds like there won't be much community appetite to roll back this change (see https://llvm.org/docs/LangRef.html#undefined-values for more details). So, we likely need to address this in Triton. AMD and NVIDIA are not affected because they use predicated load instructions/assembly that LLVM cannot see through, whereas we have to use control flow in LLVM which of course the optimizer understands.
Here's a simple reproducer. This code simulates a masked load and store using the func argument
%i
.Previously this was optimized to
but after the SimplifyCFG changes we get
this is b/c the
%i
false path caused UB with the division in%z
.Going forward we have a few options -
if.end
block entirely. If we can clean up the false path early in the pipeline, then the new behavior in SimplifyCFG will not affect us. We could introduce a new pass that looks for this pattern and adjusts it to remove UB, or we could create a mini pipeline before O3. But this will change code shape that could make optimizations later in the pass pipeline fail and degrade performance.I am planning to work on 1 for a bit while the IGC team works on 2. If those fail we can consider 3.
For a triton reproducer for this issue see https://github.com/intel/intel-xpu-backend-for-triton/commit/a23664ab54c8a50963d9944c4d9914a1b1bff5a1. You may need to adjust # of runs as the test still only fails intermittently.