LoopLoadElim: calling LoopVersioning with single-iteration loop

artagnon commented 4 months ago

The following example:

define void @lver.check.unnecessary(ptr %arg, ptr %arg1, i1 %arg2) {
entry:
  %load = load i32, ptr %arg, align 4
  br i1 %arg2, label %noloop.exit, label %loop.ph

loop.ph:                                          ; preds = %entry
  %sext7 = sext i32 %load to i64
  %gep8 = getelementptr i8, ptr %arg1, i64 8
  br label %loop.body

loop.body:                                        ; preds = %loop.body, %loop.ph
  %phi = phi i64 [ 0, %loop.ph ], [ %add, %loop.body ]
  %mul = mul i64 %phi, %sext7
  %gep10 = getelementptr double, ptr %gep8, i64 %mul
  %load11 = load double, ptr %gep10, align 8
  store double %load11, ptr %arg1, align 8
  %add = add i64 %phi, 1
  %icmp = icmp eq i64 %phi, 0
  br i1 %icmp, label %loop.exit, label %loop.body

noloop.exit:                                      ; preds = %entry
  %sext = sext i32 %load to i64
  %gep = getelementptr double, ptr %arg1, i64 %sext
  %load5 = load double, ptr %gep, align 8
  store double %load5, ptr %arg, align 8
  ret void

loop.exit:                                        ; preds = %loop.body
  ret void
}

has identical output after running loop-versioning before #92119. However, after that patch, the following diff is observed:

diff --git a/lver.before.ll b/lver.main.ll
index 9dce17d..fc81e12 100644
--- a/lver.before.ll
+++ b/lver.main.ll
@@ -4,22 +4,39 @@ source_filename = "hand-reduce.ll"
 define void @lver.check.unnecessary(ptr %arg, ptr %arg1, i1 %arg2) {
 entry:
   %load = load i32, ptr %arg, align 4
-  br i1 %arg2, label %noloop.exit, label %loop.ph
+  br i1 %arg2, label %noloop.exit, label %loop.body.lver.check

-loop.ph:                                          ; preds = %entry
+loop.body.lver.check:                             ; preds = %entry
   %sext7 = sext i32 %load to i64
   %gep8 = getelementptr i8, ptr %arg1, i64 8
+  %ident.check = icmp ne i32 %load, 1
+  br i1 %ident.check, label %loop.body.ph.lver.orig, label %loop.body.ph
+
+loop.body.ph.lver.orig:                           ; preds = %loop.body.lver.check
+  br label %loop.body.lver.orig
+
+loop.body.lver.orig:                              ; preds = %loop.body.lver.orig, %loop.body.ph.lver.orig
+  %phi.lver.orig = phi i64 [ 0, %loop.body.ph.lver.orig ], [ %add.lver.orig, %loop.body.lver.orig ]
+  %mul.lver.orig = mul i64 %phi.lver.orig, %sext7
+  %gep10.lver.orig = getelementptr double, ptr %gep8, i64 %mul.lver.orig
+  %load11.lver.orig = load double, ptr %gep10.lver.orig, align 8
+  store double %load11.lver.orig, ptr %arg1, align 8
+  %add.lver.orig = add i64 %phi.lver.orig, 1
+  %icmp.lver.orig = icmp eq i64 %phi.lver.orig, 0
+  br i1 %icmp.lver.orig, label %loop.exit.loopexit, label %loop.body.lver.orig
+
+loop.body.ph:                                     ; preds = %loop.body.lver.check
   br label %loop.body

-loop.body:                                        ; preds = %loop.body, %loop.ph
-  %phi = phi i64 [ 0, %loop.ph ], [ %add, %loop.body ]
+loop.body:                                        ; preds = %loop.body, %loop.body.ph
+  %phi = phi i64 [ 0, %loop.body.ph ], [ %add, %loop.body ]
   %mul = mul i64 %phi, %sext7
   %gep10 = getelementptr double, ptr %gep8, i64 %mul
   %load11 = load double, ptr %gep10, align 8
   store double %load11, ptr %arg1, align 8
   %add = add i64 %phi, 1
   %icmp = icmp eq i64 %phi, 0
-  br i1 %icmp, label %loop.exit, label %loop.body
+  br i1 %icmp, label %loop.exit.loopexit1, label %loop.body

 noloop.exit:                                      ; preds = %entry
   %sext = sext i32 %load to i64
@@ -28,6 +45,12 @@ noloop.exit:                                      ; preds = %entry
   store double %load5, ptr %arg, align 8
   ret void

-loop.exit:                                        ; preds = %loop.body
+loop.exit.loopexit:                               ; preds = %loop.body.lver.orig
+  br label %loop.exit
+
+loop.exit.loopexit1:                              ; preds = %loop.body
+  br label %loop.exit
+
+loop.exit:                                        ; preds = %loop.exit.loopexit1, %loop.exit.loopexit
   ret void
 }

This is a regression.

The underlying issue is in LoopAccessAnalysis, which produces a false equal predicate. The diff before and after running LAA on the example is:

diff --git a/laa.before b/laa.main
index 17b0a1b..1b3b1b0 100644
--- a/laa.before
+++ b/laa.main
@@ -7,5 +7,9 @@ Printing analysis 'Loop Access Analysis' for function 'lver.check.unnecessary':

     Non vectorizable stores to invariant address were not found in loop.
     SCEV assumptions:
+    Equal predicate: %load == 1

     Expressions re-written:
+    [PSE]  %gep10 = getelementptr double, ptr %gep8, i64 %mul:
+      {(8 + %arg1),+,(8 * (sext i32 %load to i64))<nsw>}<%loop.body>
+      --> {(8 + %arg1),+,8}<%loop.body>

efriedma-quic commented 4 months ago

What makes it "unnecessary"? You have a strided load, and if the stride is 1 you can vectorize the load. I mean, in this particular case it's not really profitable because the loop can't be vectorized for other reasons, but the analysis of the load itself seems correct.

artagnon commented 4 months ago

What makes it "unnecessary"?

The fact that the loop is executed at least once. I'm not 100% sure, but I think LAA is only supposed to speculate on the stride when the TC is unknown and may be 0, in order to version the loop and insert a Stride == 1 predicate, so that one version of the loop executes with that predicate. See also #96927.

fhahn commented 4 months ago

This is form an end-to-end run, right? If so, who is calling LoopVersioning on the loop with a single iteration? I agree that versioning for loops with a single iteration likely isn't profitable.

artagnon commented 4 months ago

Yes, it is from an end-to-end run. I'll audit the callers to fix the issue.

artagnon commented 1 month ago

After much confusion (due to blind reliance on llvm-reduce), we now have a resolution. Yes, LoopVersioning versions single-iteration loops, but LoopLoadElimination never calls LoopVersioning with single-iteration loops, since it bails out before versioning the loop: the function responsible for this is findStoreToLoadDependences. This regression is actually about LLE becoming more powerful (although only a small regression is seen in the end-to-end compile), as LAA correctly turns certain dependences from Unknown to a known one.

The actual reproducer for this regression is:

define void @unknown_stride_known_dependence(ptr %x, ptr %y, i1 %cond) {
; CHECK-LABEL: define void @unknown_stride_known_dependence(
; CHECK-SAME: ptr [[X:%.*]], ptr [[Y:%.*]], i1 [[COND:%.*]]) {
; CHECK-NEXT:  [[ENTRY:.*:]]
; CHECK-NEXT:    [[LOAD:%.*]] = load i32, ptr [[X]], align 4
; CHECK-NEXT:    br i1 [[COND]], label %[[NOLOOP_EXIT:.*]], label %[[LOOP_LVER_CHECK:.*]]
; CHECK:       [[LOOP_LVER_CHECK]]:
; CHECK-NEXT:    [[SEXT_X:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT:    [[GEP_8:%.*]] = getelementptr i8, ptr [[Y]], i64 8
; CHECK-NEXT:    [[GEP_16:%.*]] = getelementptr i8, ptr [[Y]], i64 16
; CHECK-NEXT:    [[IDENT_CHECK:%.*]] = icmp ne i32 [[LOAD]], 1
; CHECK-NEXT:    br i1 [[IDENT_CHECK]], label %[[LOOP_PH_LVER_ORIG:.*]], label %[[LOOP_PH:.*]]
; CHECK:       [[LOOP_PH_LVER_ORIG]]:
; CHECK-NEXT:    br label %[[LOOP_LVER_ORIG:.*]]
; CHECK:       [[LOOP_LVER_ORIG]]:
; CHECK-NEXT:    [[IV_LVER_ORIG:%.*]] = phi i64 [ 0, %[[LOOP_PH_LVER_ORIG]] ], [ [[IV_NEXT_LVER_ORIG:%.*]], %[[LOOP_LVER_ORIG]] ]
; CHECK-NEXT:    [[MUL_LVER_ORIG:%.*]] = mul i64 [[IV_LVER_ORIG]], [[SEXT_X]]
; CHECK-NEXT:    [[GEP_8_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT:    [[LOAD_8_LVER_ORIG:%.*]] = load double, ptr [[GEP_8_MUL_LVER_ORIG]], align 8
; CHECK-NEXT:    [[GEP_16_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT:    store double [[LOAD_8_LVER_ORIG]], ptr [[GEP_16_MUL_LVER_ORIG]], align 8
; CHECK-NEXT:    [[IV_NEXT_LVER_ORIG]] = add i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT:    [[ICMP_LVER_ORIG:%.*]] = icmp eq i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT:    br i1 [[ICMP_LVER_ORIG]], label %[[EXIT_LOOPEXIT_LOOPEXIT:.*]], label %[[LOOP_LVER_ORIG]]
; CHECK:       [[LOOP_PH]]:
; CHECK-NEXT:    [[LOAD_INITIAL:%.*]] = load double, ptr [[GEP_8]], align 8
; CHECK-NEXT:    br label %[[LOOP:.*]]
; CHECK:       [[LOOP]]:
; CHECK-NEXT:    [[STORE_FORWARDED:%.*]] = phi double [ [[LOAD_INITIAL]], %[[LOOP_PH]] ], [ [[STORE_FORWARDED]], %[[LOOP]] ]
; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[LOOP_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
; CHECK-NEXT:    [[MUL:%.*]] = mul i64 [[IV]], [[SEXT_X]]
; CHECK-NEXT:    [[GEP_8_MUL:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL]]
; CHECK-NEXT:    [[LOAD_8:%.*]] = load double, ptr [[GEP_8_MUL]], align 8
; CHECK-NEXT:    [[GEP_16_MUL:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL]]
; CHECK-NEXT:    store double [[STORE_FORWARDED]], ptr [[GEP_16_MUL]], align 8
; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
; CHECK-NEXT:    [[ICMP:%.*]] = icmp eq i64 [[IV]], 1
; CHECK-NEXT:    br i1 [[ICMP]], label %[[EXIT_LOOPEXIT_LOOPEXIT1:.*]], label %[[LOOP]]
; CHECK:       [[NOLOOP_EXIT]]:
; CHECK-NEXT:    [[SEXT:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT:    [[GEP_Y:%.*]] = getelementptr double, ptr [[Y]], i64 [[SEXT]]
; CHECK-NEXT:    [[LOAD_Y:%.*]] = load double, ptr [[GEP_Y]], align 8
; CHECK-NEXT:    store double [[LOAD_Y]], ptr [[X]], align 8
; CHECK-NEXT:    br label %[[EXIT:.*]]
; CHECK:       [[EXIT_LOOPEXIT_LOOPEXIT]]:
; CHECK-NEXT:    br label %[[EXIT_LOOPEXIT:.*]]
; CHECK:       [[EXIT_LOOPEXIT_LOOPEXIT1]]:
; CHECK-NEXT:    br label %[[EXIT_LOOPEXIT]]
; CHECK:       [[EXIT_LOOPEXIT]]:
; CHECK-NEXT:    br label %[[EXIT]]
; CHECK:       [[EXIT]]:
; CHECK-NEXT:    ret void
;
entry:
  %load = load i32, ptr %x, align 4
  br i1 %cond, label %noloop.exit, label %loop.ph

loop.ph:                                              ; preds = %entry
  %sext.x = sext i32 %load to i64
  %gep.8 = getelementptr i8, ptr %y, i64 8
  %gep.16 = getelementptr i8, ptr %y, i64 16
  br label %loop

loop:                                                 ; preds = %loop, %loop.ph
  %iv = phi i64 [ 0, %loop.ph ], [ %iv.next, %loop ]
  %mul = mul i64 %iv, %sext.x
  %gep.8.mul = getelementptr double, ptr %gep.8, i64 %mul
  %load.8 = load double, ptr %gep.8.mul, align 8
  %gep.16.mul = getelementptr double, ptr %gep.16, i64 %mul
  store double %load.8, ptr %gep.16.mul
  %iv.next = add i64 %iv, 1
  %icmp = icmp eq i64 %iv, 1
  br i1 %icmp, label %exit, label %loop

noloop.exit:                                          ; preds = %loop.ph
  %sext = sext i32 %load to i64
  %gep.y = getelementptr double, ptr %y, i64 %sext
  %load.y = load double, ptr %gep.y
  store double %load.y, ptr %x
  br label %exit

exit:                                                 ; preds = %loop.body
  ret void
}

This is actually the case of BTC = 1, where the loop needs to be versioned for correctness. Previously, LLE bailed out in findStoreToLoadDependences, as the dependences were unknown, before versioning the loop. Now, there is a valid candidate for LLE, and indeed LLE is more powerful.

Previously, LAA returned the following:

Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
  :
    Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Unknown data dependence.
    Dependences:
      Unknown:
          %16 = load double, ptr %15, align 8 ->
          store double %16, ptr %17, align 8

    Run-time memory checks:
    Grouped accesses:

    Non vectorizable stores to invariant address were not found in loop.
    SCEV assumptions:

    Expressions re-written:

Now, LAA returns:

Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
  :
    Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Backward loop carried data dependence.
    Dependences:
      Backward:
          %16 = load double, ptr %15, align 8 ->
          store double %16, ptr %17, align 8

    Run-time memory checks:
    Grouped accesses:

    Non vectorizable stores to invariant address were not found in loop.
    SCEV assumptions:
    Equal predicate: %4 == 1

    Expressions re-written:
    [PSE]  %15 = getelementptr double, ptr %10, i64 %14:
      {(8 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
      --> {(8 + %1),+,8}<%12>
    [PSE]  %17 = getelementptr double, ptr %11, i64 %14:
      {(16 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
      --> {(16 + %1),+,8}<%12>

Yes, there is an equal-predicate, due to which loop-versioning versions the loop, but this output is correct, and callers of LAA are more powerful now. Hence, I would classify this bug as invalid.

llvm / llvm-project

LoopLoadElim: calling LoopVersioning with single-iteration loop #96656