Open huihzhang opened 1 month ago
Note the corresponding GCC issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115363 .
LLVM-IR of the testcase
$ clang -O3 -mllvm --enable-loop-versioning-licm
https://godbolt.org/z/4efah5bG7
source_filename = "/app/example.cpp"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"
define dso_local void @foo(ptr nocapture noundef %A, ptr nocapture noundef readonly %B, ptr nocapture noundef readonly %LoopBound) local_unnamed_addr #0 {
entry:
%0 = load i32, ptr %LoopBound, align 4
%cmp6 = icmp sgt i32 %0, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup
for.cond.cleanup: ; preds = %for.body, %entry
ret void
for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i32, ptr %B, i64 %indvars.iv
%1 = load i32, ptr %arrayidx, align 4
%arrayidx2 = getelementptr inbounds i32, ptr %A, i64 %indvars.iv
%2 = load i32, ptr %arrayidx2, align 4
%add = add nsw i32 %2, %1
store i32 %add, ptr %arrayidx2, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%3 = load i32, ptr %LoopBound, align 4
%4 = sext i32 %3 to i64
%cmp = icmp slt i64 %indvars.iv.next, %4
br i1 %cmp, label %for.body, label %for.cond.cleanup
}
attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+fp-armv8,+neon,+outline-atomics,+v8a,-fmv" }
!llvm.module.flags = !{!0, !1, !2, !3, !4, !5}
!llvm.ident = !{!6}
!0 = !{i32 7, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"PIE Level", i32 2}
!4 = !{i32 7, !"uwtable", i32 2}
!5 = !{i32 7, !"frame-pointer", i32 1}
!6 = !{!"clang version 20.0.0git (https://github.com/llvm/llvm-project.git 01d233ff403823389f8480897e41aea84ecbb3d3)"}
!7 = !{!8, !8, i64 0}
!8 = !{!"int", !9, i64 0}
!9 = !{!"omnipotent char", !10, i64 0}
!10 = !{!"Simple C++ TBAA"}
!11 = distinct !{!11, !12}
!12 = !{!"llvm.loop.mustprogress"}
Adding restrict to the bound vectorizes it https://godbolt.org/z/bGE9EoWG3, so making the the loop versioning work for this case should trigger autovectorization.
In fact, it is not loop-versioning, as long as there is a way to establish that the pointers don't alias, the vectorization will happen. so something like
#include<algorithm>
// Return true when B does not lie in [A, A+N]
bool no_alias(int* A, int *B, int dist) {
// std::abs(std::distance(A, N)) > p1 is UB
return A > B || A+dist < B;
}
void foo(int *A, int *B, int *N) {
int p1 = *N;
if (no_alias(A, N, p1) {
for (int k = 0; k < p1; k++)
A[k] += B[k];
} else {
for (int k = 0; k < *N; k++)
A[k] += B[k];
}
}
It looks like LoopVersioningLICM bails out (LoopVersioningLICM::legalLoopStructure) as SCEV is unable to compute the loop trip count required to generate alias bounds checks.
cc: @ayalz @fhahn @alexey-bataev for guidance on how to approach this issue. Can vectorizer drive this loop versioning?
Can SCEV analyze this based on a predicate? like if vectorizer can tell SCEV to assume the loop bound doesn't alias and then compute the trip count.
There are some patterns I noticed from my previous customer codebase study: t1.cpp
t2.cpp
Loop bound in both cases requires load from pointer that may alias with memory accesses in loop.
Polly is creating runtime versioning to achieve similar effect like loop versioning. The load of loop bound is hoisted outside, thus vectorization achieved.
Both cases are good fit for LoopVersioningLICM, where load of loop bound can be hoisted for the versioned loop with no-alias assumption.
I wonder if there are enough interest in community to support vectorization for such loops. I am currently working on different projects, so might not have enough bandwidth to proceed with upstream fixes.