Closed artagnon closed 1 month ago
What makes it "unnecessary"? You have a strided load, and if the stride is 1 you can vectorize the load. I mean, in this particular case it's not really profitable because the loop can't be vectorized for other reasons, but the analysis of the load itself seems correct.
What makes it "unnecessary"?
The fact that the loop is executed at least once. I'm not 100% sure, but I think LAA is only supposed to speculate on the stride when the TC is unknown and may be 0, in order to version the loop and insert a Stride == 1
predicate, so that one version of the loop executes with that predicate. See also #96927.
This is form an end-to-end run, right? If so, who is calling LoopVersioning on the loop with a single iteration? I agree that versioning for loops with a single iteration likely isn't profitable.
Yes, it is from an end-to-end run. I'll audit the callers to fix the issue.
After much confusion (due to blind reliance on llvm-reduce), we now have a resolution. Yes, LoopVersioning versions single-iteration loops, but LoopLoadElimination never calls LoopVersioning with single-iteration loops, since it bails out before versioning the loop: the function responsible for this is findStoreToLoadDependences
. This regression is actually about LLE becoming more powerful (although only a small regression is seen in the end-to-end compile), as LAA correctly turns certain dependences from Unknown to a known one.
The actual reproducer for this regression is:
define void @unknown_stride_known_dependence(ptr %x, ptr %y, i1 %cond) {
; CHECK-LABEL: define void @unknown_stride_known_dependence(
; CHECK-SAME: ptr [[X:%.*]], ptr [[Y:%.*]], i1 [[COND:%.*]]) {
; CHECK-NEXT: [[ENTRY:.*:]]
; CHECK-NEXT: [[LOAD:%.*]] = load i32, ptr [[X]], align 4
; CHECK-NEXT: br i1 [[COND]], label %[[NOLOOP_EXIT:.*]], label %[[LOOP_LVER_CHECK:.*]]
; CHECK: [[LOOP_LVER_CHECK]]:
; CHECK-NEXT: [[SEXT_X:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr i8, ptr [[Y]], i64 8
; CHECK-NEXT: [[GEP_16:%.*]] = getelementptr i8, ptr [[Y]], i64 16
; CHECK-NEXT: [[IDENT_CHECK:%.*]] = icmp ne i32 [[LOAD]], 1
; CHECK-NEXT: br i1 [[IDENT_CHECK]], label %[[LOOP_PH_LVER_ORIG:.*]], label %[[LOOP_PH:.*]]
; CHECK: [[LOOP_PH_LVER_ORIG]]:
; CHECK-NEXT: br label %[[LOOP_LVER_ORIG:.*]]
; CHECK: [[LOOP_LVER_ORIG]]:
; CHECK-NEXT: [[IV_LVER_ORIG:%.*]] = phi i64 [ 0, %[[LOOP_PH_LVER_ORIG]] ], [ [[IV_NEXT_LVER_ORIG:%.*]], %[[LOOP_LVER_ORIG]] ]
; CHECK-NEXT: [[MUL_LVER_ORIG:%.*]] = mul i64 [[IV_LVER_ORIG]], [[SEXT_X]]
; CHECK-NEXT: [[GEP_8_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT: [[LOAD_8_LVER_ORIG:%.*]] = load double, ptr [[GEP_8_MUL_LVER_ORIG]], align 8
; CHECK-NEXT: [[GEP_16_MUL_LVER_ORIG:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL_LVER_ORIG]]
; CHECK-NEXT: store double [[LOAD_8_LVER_ORIG]], ptr [[GEP_16_MUL_LVER_ORIG]], align 8
; CHECK-NEXT: [[IV_NEXT_LVER_ORIG]] = add i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT: [[ICMP_LVER_ORIG:%.*]] = icmp eq i64 [[IV_LVER_ORIG]], 1
; CHECK-NEXT: br i1 [[ICMP_LVER_ORIG]], label %[[EXIT_LOOPEXIT_LOOPEXIT:.*]], label %[[LOOP_LVER_ORIG]]
; CHECK: [[LOOP_PH]]:
; CHECK-NEXT: [[LOAD_INITIAL:%.*]] = load double, ptr [[GEP_8]], align 8
; CHECK-NEXT: br label %[[LOOP:.*]]
; CHECK: [[LOOP]]:
; CHECK-NEXT: [[STORE_FORWARDED:%.*]] = phi double [ [[LOAD_INITIAL]], %[[LOOP_PH]] ], [ [[STORE_FORWARDED]], %[[LOOP]] ]
; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[LOOP_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
; CHECK-NEXT: [[MUL:%.*]] = mul i64 [[IV]], [[SEXT_X]]
; CHECK-NEXT: [[GEP_8_MUL:%.*]] = getelementptr double, ptr [[GEP_8]], i64 [[MUL]]
; CHECK-NEXT: [[LOAD_8:%.*]] = load double, ptr [[GEP_8_MUL]], align 8
; CHECK-NEXT: [[GEP_16_MUL:%.*]] = getelementptr double, ptr [[GEP_16]], i64 [[MUL]]
; CHECK-NEXT: store double [[STORE_FORWARDED]], ptr [[GEP_16_MUL]], align 8
; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
; CHECK-NEXT: [[ICMP:%.*]] = icmp eq i64 [[IV]], 1
; CHECK-NEXT: br i1 [[ICMP]], label %[[EXIT_LOOPEXIT_LOOPEXIT1:.*]], label %[[LOOP]]
; CHECK: [[NOLOOP_EXIT]]:
; CHECK-NEXT: [[SEXT:%.*]] = sext i32 [[LOAD]] to i64
; CHECK-NEXT: [[GEP_Y:%.*]] = getelementptr double, ptr [[Y]], i64 [[SEXT]]
; CHECK-NEXT: [[LOAD_Y:%.*]] = load double, ptr [[GEP_Y]], align 8
; CHECK-NEXT: store double [[LOAD_Y]], ptr [[X]], align 8
; CHECK-NEXT: br label %[[EXIT:.*]]
; CHECK: [[EXIT_LOOPEXIT_LOOPEXIT]]:
; CHECK-NEXT: br label %[[EXIT_LOOPEXIT:.*]]
; CHECK: [[EXIT_LOOPEXIT_LOOPEXIT1]]:
; CHECK-NEXT: br label %[[EXIT_LOOPEXIT]]
; CHECK: [[EXIT_LOOPEXIT]]:
; CHECK-NEXT: br label %[[EXIT]]
; CHECK: [[EXIT]]:
; CHECK-NEXT: ret void
;
entry:
%load = load i32, ptr %x, align 4
br i1 %cond, label %noloop.exit, label %loop.ph
loop.ph: ; preds = %entry
%sext.x = sext i32 %load to i64
%gep.8 = getelementptr i8, ptr %y, i64 8
%gep.16 = getelementptr i8, ptr %y, i64 16
br label %loop
loop: ; preds = %loop, %loop.ph
%iv = phi i64 [ 0, %loop.ph ], [ %iv.next, %loop ]
%mul = mul i64 %iv, %sext.x
%gep.8.mul = getelementptr double, ptr %gep.8, i64 %mul
%load.8 = load double, ptr %gep.8.mul, align 8
%gep.16.mul = getelementptr double, ptr %gep.16, i64 %mul
store double %load.8, ptr %gep.16.mul
%iv.next = add i64 %iv, 1
%icmp = icmp eq i64 %iv, 1
br i1 %icmp, label %exit, label %loop
noloop.exit: ; preds = %loop.ph
%sext = sext i32 %load to i64
%gep.y = getelementptr double, ptr %y, i64 %sext
%load.y = load double, ptr %gep.y
store double %load.y, ptr %x
br label %exit
exit: ; preds = %loop.body
ret void
}
This is actually the case of BTC = 1, where the loop needs to be versioned for correctness. Previously, LLE bailed out in findStoreToLoadDependences
, as the dependences were unknown, before versioning the loop. Now, there is a valid candidate for LLE, and indeed LLE is more powerful.
Previously, LAA returned the following:
Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
:
Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Unknown data dependence.
Dependences:
Unknown:
%16 = load double, ptr %15, align 8 ->
store double %16, ptr %17, align 8
Run-time memory checks:
Grouped accesses:
Non vectorizable stores to invariant address were not found in loop.
SCEV assumptions:
Expressions re-written:
Now, LAA returns:
Printing analysis 'Loop Access Analysis' for function 'dgtsv_':
:
Report: unsafe dependent memory operations in loop. Use #pragma clang loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop
Backward loop carried data dependence.
Dependences:
Backward:
%16 = load double, ptr %15, align 8 ->
store double %16, ptr %17, align 8
Run-time memory checks:
Grouped accesses:
Non vectorizable stores to invariant address were not found in loop.
SCEV assumptions:
Equal predicate: %4 == 1
Expressions re-written:
[PSE] %15 = getelementptr double, ptr %10, i64 %14:
{(8 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
--> {(8 + %1),+,8}<%12>
[PSE] %17 = getelementptr double, ptr %11, i64 %14:
{(16 + %1),+,(8 * (sext i32 %4 to i64))<nsw>}<%12>
--> {(16 + %1),+,8}<%12>
Yes, there is an equal-predicate, due to which loop-versioning versions the loop, but this output is correct, and callers of LAA are more powerful now. Hence, I would classify this bug as invalid.
The following example:
has identical output after running loop-versioning before #92119. However, after that patch, the following diff is observed:
This is a regression.
The underlying issue is in LoopAccessAnalysis, which produces a false equal predicate. The diff before and after running LAA on the example is: