LoopVersioningLICM missing vectorization when loop bound is invariant under no-alias assumption

huihzhang commented 1 month ago

There are some patterns I noticed from my previous customer codebase study: t1.cpp

void foo(int *A, int *B, int *LoopBound) {
  for (int k = 0; k < *LoopBound; k++)
    A[k] += B[k];
}

t2.cpp

class Base {
protected:
  int m_totalPhase;
};
class Derived : public Base {
public:
  void foo(int* A, int *B);
};

void Derived::foo(int *A, int *B) {
  for (int k = 0; k < m_totalPhase; k++)
    A[k] += B[k];
}

Loop bound in both cases requires load from pointer that may alias with memory accesses in loop.

clang++ --target=aarch64 -mcpu=cortex-a57 -c -O3 t[1|2].cpp -Rpass-missed=loop-vectorize

t1.cpp:2:3: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
    2 |   for (int k = 0; k < *LoopBound; k++)

Polly is creating runtime versioning to achieve similar effect like loop versioning. The load of loop bound is hoisted outside, thus vectorization achieved.

clang++ --target=aarch64 -mcpu=cortex-a57 -c -O3 t[1|2].cpp -Rpass=loop-vectorize -mllvm -polly -mllvm -polly-invariant-load-hoisting -mllvm -polly-process-unprofitable

t1.cpp:2:3: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
    2 |   for (int k = 0; k < *LoopBound; k++)
t2.cpp:12:3: remark: vectorized loop (vectorization width: 4, interleaved count: 4) [-Rpass=loop-vectorize]
   12 |   for (int k = 0; k < m_totalPhase; k++)

Both cases are good fit for LoopVersioningLICM, where load of loop bound can be hoisted for the versioned loop with no-alias assumption.

I wonder if there are enough interest in community to support vectorization for such loops. I am currently working on different projects, so might not have enough bandwidth to proceed with upstream fixes.

pinskia commented 1 month ago

Note the corresponding GCC issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115363 .

appujee commented 1 week ago

LLVM-IR of the testcase

$ clang -O3 -mllvm --enable-loop-versioning-licm

https://godbolt.org/z/4efah5bG7

source_filename = "/app/example.cpp"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "aarch64-unknown-linux-gnu"

define dso_local void @foo(ptr nocapture noundef %A, ptr nocapture noundef readonly %B, ptr nocapture noundef readonly %LoopBound) local_unnamed_addr #0 {
entry:
  %0 = load i32, ptr %LoopBound, align 4
  %cmp6 = icmp sgt i32 %0, 0
  br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.body, %entry
  ret void

for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
  %arrayidx = getelementptr inbounds i32, ptr %B, i64 %indvars.iv
  %1 = load i32, ptr %arrayidx, align 4
  %arrayidx2 = getelementptr inbounds i32, ptr %A, i64 %indvars.iv
  %2 = load i32, ptr %arrayidx2, align 4
  %add = add nsw i32 %2, %1
  store i32 %add, ptr %arrayidx2, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %3 = load i32, ptr %LoopBound, align 4
  %4 = sext i32 %3 to i64
  %cmp = icmp slt i64 %indvars.iv.next, %4
  br i1 %cmp, label %for.body, label %for.cond.cleanup
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind memory(argmem: readwrite) uwtable "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+fp-armv8,+neon,+outline-atomics,+v8a,-fmv" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5}
!llvm.ident = !{!6}

!0 = !{i32 7, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"PIE Level", i32 2}
!4 = !{i32 7, !"uwtable", i32 2}
!5 = !{i32 7, !"frame-pointer", i32 1}
!6 = !{!"clang version 20.0.0git (https://github.com/llvm/llvm-project.git 01d233ff403823389f8480897e41aea84ecbb3d3)"}
!7 = !{!8, !8, i64 0}
!8 = !{!"int", !9, i64 0}
!9 = !{!"omnipotent char", !10, i64 0}
!10 = !{!"Simple C++ TBAA"}
!11 = distinct !{!11, !12}
!12 = !{!"llvm.loop.mustprogress"}

appujee commented 1 week ago

Adding restrict to the bound vectorizes it https://godbolt.org/z/bGE9EoWG3, so making the the loop versioning work for this case should trigger autovectorization.

appujee commented 1 week ago

In fact, it is not loop-versioning, as long as there is a way to establish that the pointers don't alias, the vectorization will happen. so something like

#include<algorithm>

// Return true when B does not lie in [A, A+N]
bool no_alias(int* A, int *B, int dist) {
  // std::abs(std::distance(A, N)) > p1 is UB
  return A > B || A+dist < B;
}

void foo(int *A, int *B, int *N) {
  int p1 = *N;
  if (no_alias(A, N, p1) {
    for (int k = 0; k < p1; k++)
        A[k] += B[k];
  } else {
    for (int k = 0; k < *N; k++)
        A[k] += B[k];
  }
}

https://godbolt.org/z/b1xhf3bb7

chrisj-quic commented 3 days ago

It looks like LoopVersioningLICM bails out (LoopVersioningLICM::legalLoopStructure) as SCEV is unable to compute the loop trip count required to generate alias bounds checks.

hiraditya commented 2 days ago

cc: @ayalz @fhahn @alexey-bataev for guidance on how to approach this issue. Can vectorizer drive this loop versioning?

Can SCEV analyze this based on a predicate? like if vectorizer can tell SCEV to assume the loop bound doesn't alias and then compute the trip count.

llvm / llvm-project

LoopVersioningLICM missing vectorization when loop bound is invariant under no-alias assumption #112060