intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

Changes in LLVM SimplifyCFG result in the elimination of mask load paths when the mask is the divisor in a division instruction #2726

Open alexbaden opened 6 days ago

alexbaden commented 6 days ago

Changes to SimplifyCFG to more aggressively eliminate paths with UB have resulted in a regression in the masked load path. When we have a masked load where other is 0 (the default), and the result of that load is used in the divisor of a division operation, the division op is marked UB on the masked path and deleted. This results in Triton always evaluating the unmasked load path, even if the loads were supposed to be masked. Discussing with the LLVM folks at Intel, it sounds like there won't be much community appetite to roll back this change (see https://llvm.org/docs/LangRef.html#undefined-values for more details). So, we likely need to address this in Triton. AMD and NVIDIA are not affected because they use predicated load instructions/assembly that LLVM cannot see through, whereas we have to use control flow in LLVM which of course the optimizer understands.

Here's a simple reproducer. This code simulates a masked load and store using the func argument %i.

define void @phi_div_of_zero_okay(i8 noundef %x, i8 %i, ptr %v) {
; CHECK-LABEL: @phi_div_of_zero_okay(
entry:
  %cmp = icmp ult i8 %i, 9
  br i1 %cmp, label %if.then, label %if.end

if.then:
  %y = load i8, ptr %v, align 8
  br label %if.end

if.end:
  %yy = phi i8 [ %y, %if.then ], [ 0, %entry ]
  %z = sdiv i8 %x, %yy
  br i1 %cmp, label %if2.then, label %if2.end 

if2.then:
  store i8 %z, ptr %v, align 8
  br label %if2.end

if2.end:
  ret void
}

Previously this was optimized to

define void @phi_div_of_zero_okay(i8 noundef %x, i8 %i, ptr %v) {
entry:
  %cmp = icmp ult i8 %i, 9
  br i1 %cmp, label %if2.then, label %if.end

if.end:                                           ; preds = %entry
  br label %if2.end

if2.then:                                         ; preds = %entry
  %y = load i8, ptr %v, align 8
  %z2 = sdiv i8 %x, %y
  store i8 %z2, ptr %v, align 8
  br label %if2.end

if2.end:                                          ; preds = %if.end, %if2.then
  ret void
}

but after the SimplifyCFG changes we get

define void @phi_div_of_zero_okay(i8 noundef %x, i8 %i, ptr %v) {
entry:
  %cmp = icmp ult i8 %i, 9
  call void @llvm.assume(i1 %cmp)
  %y = load i8, ptr %v, align 8
  %z = sdiv i8 %x, %y
  br i1 %cmp, label %if2.then, label %if2.end

if2.then:                                         ; preds = %entry
  store i8 %z, ptr %v, align 8
  br label %if2.end

if2.end:                                          ; preds = %if2.then, %entry
  ret void
}

this is b/c the %i false path caused UB with the division in %z.

Going forward we have a few options -

  1. Modify the LLVM pass pipeline. The UB result of the false path is never used, and in fact in our desired result the false path ends up bypassing intructions in the if.end block entirely. If we can clean up the false path early in the pipeline, then the new behavior in SimplifyCFG will not affect us. We could introduce a new pass that looks for this pattern and adjusts it to remove UB, or we could create a mini pipeline before O3. But this will change code shape that could make optimizations later in the pass pipeline fail and degrade performance.
  2. Introduce a new load instruction that hides the masking. I'm not sure how this would work - maybe a GenISA intrinsic that we rewrite at the very end? This may cause problems later in the pipeline. Note that I am working with IGC on predicated load instruction support which would solve this problem by making our backend the same as NVIDIA / AMD.
  3. Try to get upstream LLVM to back out this portion of the change to SimplifyCFG. Because it happens fairly early in the O3 pipeline it is a very aggressive optimization, but the use of UB by Triton here is counter to LLVM's language manual.

I am planning to work on 1 for a bit while the IGC team works on 2. If those fail we can consider 3.

For a triton reproducer for this issue see https://github.com/intel/intel-xpu-backend-for-triton/commit/a23664ab54c8a50963d9944c4d9914a1b1bff5a1. You may need to adjust # of runs as the test still only fails intermittently.