llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
26.78k stars 10.97k forks source link

[Clang][LoopAccessAnalysis] Stride might be wrongly replaced by 1 in certain situation. #95086

Open Yunzezhu94 opened 3 weeks ago

Yunzezhu94 commented 3 weeks ago

Recently when I'm trying loop vectorize on llvm 17.x version, in a certain case I found when storing value into array sometimes it stored into wrong position. Here is a case reduced from the case I met with the problem:

extern int a[];
extern bool b[];
template <typename c> c d(c, c);
void e(long f, long long g[][1][1][1]) {
  if (d(g[4][4][4][0], (long long)e))
    for (int h = 0; h < 1; h += (int)f + 7)
      a[h * 3] = b[h];
}

When compile with flag: -std=c++11 --target=riscv64-unknown-linux -march=rv64imafdcv -fPIC -w -fopenmp-simd -O3 there is

%add = shl i64 %f, 32
  %sext = add i64 %add, 30064771072
  %1 = ashr exact i64 %sext, 32

and %1 is the stride equal to (int)f + 7 . When turns on debug flag there is debug log: LAA: Replacing SCEV: {@b,+,%1}<nw><%for.body> by: {@b,+,1}<nw><%for.body> and in vplan there is vp<%4> = SCALAR-STEPS vp<%2>, ir<1> That is the stride is wrongly replaced by 1.

In my opinion such versioning should be applied on the strides of symbolically striding memory accesses. So I checked the source code and found the symbolic stride is collected in getStrideFromPointer:

const SCEVUnknown *U = dyn_cast<SCEVUnknown>(V);
  if (!U) {
    const auto *C = dyn_cast<SCEVIntegralCastExpr>(V);
    if (!C)
      return nullptr;
    U = dyn_cast<SCEVUnknown>(C->getOperand());
    if (!U)
      return nullptr;

    // Match legacy behavior - this is not needed for correctness
    if (!getUniqueCastUse(U->getValue(), Lp, V->getType()))
      return nullptr;
  }

  return V;

The value V is regard as symbolic value when dyn_cast(V) returns a SCEVUnknown. However, in the case I post the stride has pattern:

        X = Shl A, n
        Y = Add X, c
        Z = AShr Y, m
        n, c and m are constants.

which is not supported in createSCEV on branch 17.x, and createSCEV return getUnknown when processing 1%, making 1% regard as symbolic stride value and replaced by 1 later.

I checked source code on main branch and found the ashr pattern is supported, so the case I post will not report error. But still in any case dyn_cast(V) returns a SCEVUnknown will return V as symbolic value, and this might be a risk. I think returning a SCEVUnknown is not equal to regarding V as symbolic value because unsupported patterns in createSCEV will also return a SCEVUnknown, and in my opinion there should be an extra check to stop these unsupported pattern value being regard as symbolic value.

fhahn commented 3 weeks ago

I might be missing something, but there should be a check added in the generated code to ensure %1 == 1. The predicate might always be false, but shouldn't result in incorrect code?

Yunzezhu94 commented 3 weeks ago

I might be missing something, but there should be a check added in the generated code to ensure %1 == 1. The predicate might always be false, but shouldn't result in incorrect code?

I checked code around replaceSymbolicStrideSCEV, but did not find such check.

const SCEV *llvm::replaceSymbolicStrideSCEV(PredicatedScalarEvolution &PSE,
                                            const DenseMap<Value *, const SCEV *> &PtrToStride,
                                            Value *Ptr) {
  const SCEV *OrigSCEV = PSE.getSCEV(Ptr);

  // If there is an entry in the map return the SCEV of the pointer with the
  // symbolic stride replaced by one.
  DenseMap<Value *, const SCEV *>::const_iterator SI = PtrToStride.find(Ptr);
  if (SI == PtrToStride.end())
    // For a non-symbolic stride, just return the original expression.
    return OrigSCEV;

  const SCEV *StrideSCEV = SI->second;
  // Note: This assert is both overly strong and overly weak.  The actual
  // invariant here is that StrideSCEV should be loop invariant.  The only
  // such invariant strides we happen to speculate right now are unknowns
  // and thus this is a reasonable proxy of the actual invariant.
  assert(isa<SCEVUnknown>(StrideSCEV) && "shouldn't be in map");

  ScalarEvolution *SE = PSE.getSE();
  const auto *CT = SE->getOne(StrideSCEV->getType());
  PSE.addPredicate(*SE->getEqualPredicate(StrideSCEV, CT));
  auto *Expr = PSE.getSCEV(Ptr);

  LLVM_DEBUG(dbgs() << "LAA: Replacing SCEV: " << *OrigSCEV
         << " by: " << *Expr << "\n");
  return Expr;
}

It looks the value of stride is not checked but a predicate that equal to 1 is add to any stride passes entry check and SCEVUnknown check, even if it does not equal to 1. When a stride not equal to 1 being wrongly replaced, it may cause error in stride storing/loading, and that is the error I met with the case.

nikic commented 3 weeks ago

@Yunzezhu94 Adding a predicate to PSE means generating a run-time check for the predicate.

Yunzezhu94 commented 3 weeks ago

@Yunzezhu94 Adding a predicate to PSE means generating a run-time check for the predicate.

Thanks! I tried a similar case on recent llvm and the vector part is skipped at run-time check when %1 is wrongly replaced by 1, so it looks such a case won't have a wrong run result. However, I noticed that the wrong vector part is still in the compiled program even if it is skipped at run-time. Skipping the whole vector loop may cause losses of performance when the stride is wrongly replaced, since the original loop may be correctly vectorized without versioning. On the other hand, I doubt if the run-time check works all time that in certain situation like compiling with openmp, the run-time check might be skipped. So I still have the point that an extra check should be added at compile time that loops have more chance to be vectorized and less possibility to get a wrong run result.

Yunzezhu94 commented 3 weeks ago

I might be missing something, but there should be a check added in the generated code to ensure %1 == 1. The predicate might always be false, but shouldn't result in incorrect code?

It looks the run-time check is not generated correctly and failed to prevent incorrect code from running at run-time on llvm 17.x. Still I think this issue needs to be repair, or an extra check should be added during compiling, since run-time check does not seem to be 100% safe.

Yunzezhu94 commented 2 weeks ago

Gentle ping.

nikic commented 2 weeks ago

Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.

Yunzezhu94 commented 2 weeks ago

Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.

Yes, here is a reduced case:

extern long arr_350[];
void a(bool b) {
  for (;;)
    for (short c; c < 4; c += b + 4)
      arr_350[c] = 0;
}

When compiling this case on LLVM main with flag: -std=c++11 -fPIC -w --target=riscv64-unknown-linux -march=rv64imafdcv -O3 -mllvm -enable-loop-distribute -mllvm -loop-distribute-non-if-convertible=true -mllvm -print-after=loop-vectorize we can get log print after loop vectorize pass:

entry:
  %0 = select i1 %b, i64 5, i64 4
  br label %for.cond

for.cond.loopexit.loopexit:                       ; preds = %middle.block, %for.body
  %indvars.iv.next.lcssa = phi i64 [ %indvars.iv.next, %for.body ], [ %ind.end, %middle.block ]
  %1 = trunc nsw i64 %indvars.iv.next.lcssa to i16
  br label %for.cond.loopexit

for.cond.loopexit:                                ; preds = %for.cond.loopexit.loopexit, %for.cond
  %c.1.lcssa = phi i16 [ %c.0, %for.cond ], [ %1, %for.cond.loopexit.loopexit ]
  br label %for.cond, !llvm.loop !8

for.cond:                                         ; preds = %for.cond.loopexit, %entry
  %c.0 = phi i16 [ undef, %entry ], [ %c.1.lcssa, %for.cond.loopexit ]
  %cmp8 = icmp slt i16 %c.0, 4
  br i1 %cmp8, label %for.body.preheader, label %for.cond.loopexit

for.body.preheader:                               ; preds = %for.cond
  %2 = sext i16 %c.0 to i64
  %3 = add i64 %0, %2
  %smax = call i64 @llvm.smax.i64(i64 %3, i64 4)
  %4 = sub i64 %smax, %3
  %umin = call i64 @llvm.umin.i64(i64 %4, i64 1)
  %5 = add i64 %umin, 1
  %6 = sub i64 %smax, %umin
  %7 = sub i64 %6, %3
  %8 = udiv i64 %7, %0
  %9 = add i64 %5, %8
  %10 = call i64 @llvm.vscale.i64()
  %11 = mul i64 %10, 2
  %min.iters.check = icmp ult i64 %9, %11
  br i1 %min.iters.check, label %scalar.ph, label %vector.scevcheck

vector.scevcheck:                                 ; preds = %for.body.preheader
  br i1 true, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.scevcheck
  %12 = call i64 @llvm.vscale.i64()
  %13 = mul i64 %12, 2
  %n.mod.vf = urem i64 %9, %13
  %n.vec = sub i64 %9, %n.mod.vf
  %14 = mul i64 %n.vec, %0
  %ind.end = add i64 %2, %14
  %15 = call i64 @llvm.vscale.i64()
  %16 = mul i64 %15, 2
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 %2, %index
  %17 = add i64 %offset.idx, 0
  %18 = getelementptr inbounds [0 x i64], ptr @arr_350, i64 0, i64 %17
  %19 = getelementptr inbounds i64, ptr %18, i32 0
  store <vscale x 2 x i64> zeroinitializer, ptr %19, align 8, !tbaa !10
  %20 = add nsw i64 %17, 1
  %index.next = add nuw i64 %index, %16
  %21 = icmp eq i64 %index.next, %n.vec
  br i1 %21, label %middle.block, label %vector.body, !llvm.loop !14

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %9, %n.vec
  br i1 %cmp.n, label %for.cond.loopexit.loopexit, label %scalar.ph

scalar.ph:                                        ; preds = %vector.scevcheck, %for.body.preheader, %middle.block
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ %2, %for.body.preheader ], [ %2, %vector.scevcheck ]
  br label %for.body

for.body:                                         ; preds = %scalar.ph, %for.body
  %indvars.iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds [0 x i64], ptr @arr_350, i64 0, i64 %indvars.iv
  store i64 0, ptr %arrayidx, align 8, !tbaa !10
  %indvars.iv.next = add nsw i64 %indvars.iv, %0
  %cmp = icmp slt i64 %indvars.iv.next, 4
  br i1 %cmp, label %for.body, label %for.cond.loopexit.loopexit, !llvm.loop !17
}

This case won't get error at running since it won't run into vector part, but in the vector part of log the index increase vscale x 2 for each vector loop. In this case it chooses VF = vscale x 2 and the original stride is 0% so I think in vector part the index should increase vscale x 2 X 0% for each vector loop.

Yunzezhu94 commented 1 week ago

Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.

Here is another reduced case that generates wrong result: A.ii :

extern "C" void printf(...);
unsigned long long a;
void b(unsigned long long *a, long c) {
  *a ^= c + 2654435769 + (*a << 6) + (*a >> 2);
}
bool var_157;
int var_158;
char var_159;
bool var_166;
int d[4][24];
short e[1][24];
signed char f[1][24][24][24];
unsigned char g[1][24];
short h[1][24][24][24];
unsigned long long i[1][24][24];
signed char j[1][24][24][24];
unsigned long long k[1][24][24][24];
long arr_350[24][24][576];
long l[331776];
void test(int, bool, int[][24], short[][24], signed char[][24][24][24],
          unsigned char[][24], short[][24][24][24],
          unsigned long long[][24][24], signed char[][24][24][24],
          unsigned long long[][24][24][24]);
int main() {
  for (long m = 0; m < 24; ++m)
    for (long n = 0; n < 4; ++n)
      d[m][n] = -1234087353;
  for (long m = 0; m < 24; ++m)
    for (long n = 0; n < 24; ++n)
      for (long o = 0; o < 24; ++o)
        for (long p = 0; p < 24; ++p)
          arr_350[m][n][o * 24 + p] = 2245753719890332927;
  test(78403368, 1, d, e, f, g, h, i, j, k);
  for (long m = 0; m < 24; ++m)
    for (long n = 0; n < 24; ++n)
      for (long o = 0; o < 24; ++o)
        for (long p = 0; p < 24; ++p)
          b(&a, arr_350[m][n][o * 24 + p]);
  printf("%llu\n", a);
}

B.ii :

extern bool var_157;
extern int var_158;
extern char var_159;
extern int a;
extern bool var_166;
extern long arr_350[][24][24][24];
short b;
long q;
short c;
int s;
void test(int d, bool e, int f[][24], short g[][24],
          signed char h[][24][24][24], unsigned char i[][24],
          short j[][24][24][24], unsigned long long k[][24][24],
          signed char l[][24][24][24], unsigned long long m[][24][24][24]) {
  for (char n = 1; n < (char)d; n += 4)
    for (char o = 0; o < 24; o++)
      for (bool p = 0; p < 1; p = 2)
#pragma clang loop vectorize_predicate(enable)
        for (short r = 0; r < 24; r += e + 4) {
          arr_350[n + 1][o][p][r] = ~f[n][1];
          if (h[1][n][p][1] != h[p][p][p][1])
            b = k[2][p][1];
          var_157 = var_158 = var_159 = q = m[1][1][1][n] / l[1][1][1][2];
          a = c = i[n][1];
          var_166 = s = j[n][3][2][2] ? g[3][8] : 0;
        }
}

When compile with command: clang++ -std=c++11 --target=riscv64-unknown-linux -march=rv64imafdcv -O0 -o a.out A.ii B.ii, It returns 14404730799330978103 at run time. However, when compile with command: clang++ -std=c++11 -fPIC -w --target=riscv64-unknown-linux -march=rv64imafdcv -O3 -mllvm -enable-loop-distribute -mllvm -loop-distribute-non-if-convertible=true -o a.out A.ii B.ii It returns 13460671840389111898 which is incorrect.