Open Yunzezhu94 opened 3 weeks ago
I might be missing something, but there should be a check added in the generated code to ensure %1 == 1
. The predicate might always be false, but shouldn't result in incorrect code?
I might be missing something, but there should be a check added in the generated code to ensure
%1 == 1
. The predicate might always be false, but shouldn't result in incorrect code?
I checked code around replaceSymbolicStrideSCEV, but did not find such check.
const SCEV *llvm::replaceSymbolicStrideSCEV(PredicatedScalarEvolution &PSE,
const DenseMap<Value *, const SCEV *> &PtrToStride,
Value *Ptr) {
const SCEV *OrigSCEV = PSE.getSCEV(Ptr);
// If there is an entry in the map return the SCEV of the pointer with the
// symbolic stride replaced by one.
DenseMap<Value *, const SCEV *>::const_iterator SI = PtrToStride.find(Ptr);
if (SI == PtrToStride.end())
// For a non-symbolic stride, just return the original expression.
return OrigSCEV;
const SCEV *StrideSCEV = SI->second;
// Note: This assert is both overly strong and overly weak. The actual
// invariant here is that StrideSCEV should be loop invariant. The only
// such invariant strides we happen to speculate right now are unknowns
// and thus this is a reasonable proxy of the actual invariant.
assert(isa<SCEVUnknown>(StrideSCEV) && "shouldn't be in map");
ScalarEvolution *SE = PSE.getSE();
const auto *CT = SE->getOne(StrideSCEV->getType());
PSE.addPredicate(*SE->getEqualPredicate(StrideSCEV, CT));
auto *Expr = PSE.getSCEV(Ptr);
LLVM_DEBUG(dbgs() << "LAA: Replacing SCEV: " << *OrigSCEV
<< " by: " << *Expr << "\n");
return Expr;
}
It looks the value of stride is not checked but a predicate that equal to 1 is add to any stride passes entry check and SCEVUnknown check, even if it does not equal to 1. When a stride not equal to 1 being wrongly replaced, it may cause error in stride storing/loading, and that is the error I met with the case.
@Yunzezhu94 Adding a predicate to PSE means generating a run-time check for the predicate.
@Yunzezhu94 Adding a predicate to PSE means generating a run-time check for the predicate.
Thanks! I tried a similar case on recent llvm and the vector part is skipped at run-time check when %1 is wrongly replaced by 1, so it looks such a case won't have a wrong run result. However, I noticed that the wrong vector part is still in the compiled program even if it is skipped at run-time. Skipping the whole vector loop may cause losses of performance when the stride is wrongly replaced, since the original loop may be correctly vectorized without versioning. On the other hand, I doubt if the run-time check works all time that in certain situation like compiling with openmp, the run-time check might be skipped. So I still have the point that an extra check should be added at compile time that loops have more chance to be vectorized and less possibility to get a wrong run result.
I might be missing something, but there should be a check added in the generated code to ensure
%1 == 1
. The predicate might always be false, but shouldn't result in incorrect code?
It looks the run-time check is not generated correctly and failed to prevent incorrect code from running at run-time on llvm 17.x. Still I think this issue needs to be repair, or an extra check should be added during compiling, since run-time check does not seem to be 100% safe.
Gentle ping.
Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.
Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.
Yes, here is a reduced case:
extern long arr_350[];
void a(bool b) {
for (;;)
for (short c; c < 4; c += b + 4)
arr_350[c] = 0;
}
When compiling this case on LLVM main with flag: -std=c++11 -fPIC -w --target=riscv64-unknown-linux -march=rv64imafdcv -O3 -mllvm -enable-loop-distribute -mllvm -loop-distribute-non-if-convertible=true -mllvm -print-after=loop-vectorize we can get log print after loop vectorize pass:
entry:
%0 = select i1 %b, i64 5, i64 4
br label %for.cond
for.cond.loopexit.loopexit: ; preds = %middle.block, %for.body
%indvars.iv.next.lcssa = phi i64 [ %indvars.iv.next, %for.body ], [ %ind.end, %middle.block ]
%1 = trunc nsw i64 %indvars.iv.next.lcssa to i16
br label %for.cond.loopexit
for.cond.loopexit: ; preds = %for.cond.loopexit.loopexit, %for.cond
%c.1.lcssa = phi i16 [ %c.0, %for.cond ], [ %1, %for.cond.loopexit.loopexit ]
br label %for.cond, !llvm.loop !8
for.cond: ; preds = %for.cond.loopexit, %entry
%c.0 = phi i16 [ undef, %entry ], [ %c.1.lcssa, %for.cond.loopexit ]
%cmp8 = icmp slt i16 %c.0, 4
br i1 %cmp8, label %for.body.preheader, label %for.cond.loopexit
for.body.preheader: ; preds = %for.cond
%2 = sext i16 %c.0 to i64
%3 = add i64 %0, %2
%smax = call i64 @llvm.smax.i64(i64 %3, i64 4)
%4 = sub i64 %smax, %3
%umin = call i64 @llvm.umin.i64(i64 %4, i64 1)
%5 = add i64 %umin, 1
%6 = sub i64 %smax, %umin
%7 = sub i64 %6, %3
%8 = udiv i64 %7, %0
%9 = add i64 %5, %8
%10 = call i64 @llvm.vscale.i64()
%11 = mul i64 %10, 2
%min.iters.check = icmp ult i64 %9, %11
br i1 %min.iters.check, label %scalar.ph, label %vector.scevcheck
vector.scevcheck: ; preds = %for.body.preheader
br i1 true, label %scalar.ph, label %vector.ph
vector.ph: ; preds = %vector.scevcheck
%12 = call i64 @llvm.vscale.i64()
%13 = mul i64 %12, 2
%n.mod.vf = urem i64 %9, %13
%n.vec = sub i64 %9, %n.mod.vf
%14 = mul i64 %n.vec, %0
%ind.end = add i64 %2, %14
%15 = call i64 @llvm.vscale.i64()
%16 = mul i64 %15, 2
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%offset.idx = add i64 %2, %index
%17 = add i64 %offset.idx, 0
%18 = getelementptr inbounds [0 x i64], ptr @arr_350, i64 0, i64 %17
%19 = getelementptr inbounds i64, ptr %18, i32 0
store <vscale x 2 x i64> zeroinitializer, ptr %19, align 8, !tbaa !10
%20 = add nsw i64 %17, 1
%index.next = add nuw i64 %index, %16
%21 = icmp eq i64 %index.next, %n.vec
br i1 %21, label %middle.block, label %vector.body, !llvm.loop !14
middle.block: ; preds = %vector.body
%cmp.n = icmp eq i64 %9, %n.vec
br i1 %cmp.n, label %for.cond.loopexit.loopexit, label %scalar.ph
scalar.ph: ; preds = %vector.scevcheck, %for.body.preheader, %middle.block
%bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ %2, %for.body.preheader ], [ %2, %vector.scevcheck ]
br label %for.body
for.body: ; preds = %scalar.ph, %for.body
%indvars.iv = phi i64 [ %bc.resume.val, %scalar.ph ], [ %indvars.iv.next, %for.body ]
%arrayidx = getelementptr inbounds [0 x i64], ptr @arr_350, i64 0, i64 %indvars.iv
store i64 0, ptr %arrayidx, align 8, !tbaa !10
%indvars.iv.next = add nsw i64 %indvars.iv, %0
%cmp = icmp slt i64 %indvars.iv.next, 4
br i1 %cmp, label %for.body, label %for.cond.loopexit.loopexit, !llvm.loop !17
}
This case won't get error at running since it won't run into vector part, but in the vector part of log the index increase vscale x 2 for each vector loop. In this case it chooses VF = vscale x 2 and the original stride is 0% so I think in vector part the index should increase vscale x 2 X 0% for each vector loop.
Do you have an example that shows a miscompilation on current LLVM main (or at least LLVM 18)? If not, I don't think there is anything to do here.
Here is another reduced case that generates wrong result: A.ii :
extern "C" void printf(...);
unsigned long long a;
void b(unsigned long long *a, long c) {
*a ^= c + 2654435769 + (*a << 6) + (*a >> 2);
}
bool var_157;
int var_158;
char var_159;
bool var_166;
int d[4][24];
short e[1][24];
signed char f[1][24][24][24];
unsigned char g[1][24];
short h[1][24][24][24];
unsigned long long i[1][24][24];
signed char j[1][24][24][24];
unsigned long long k[1][24][24][24];
long arr_350[24][24][576];
long l[331776];
void test(int, bool, int[][24], short[][24], signed char[][24][24][24],
unsigned char[][24], short[][24][24][24],
unsigned long long[][24][24], signed char[][24][24][24],
unsigned long long[][24][24][24]);
int main() {
for (long m = 0; m < 24; ++m)
for (long n = 0; n < 4; ++n)
d[m][n] = -1234087353;
for (long m = 0; m < 24; ++m)
for (long n = 0; n < 24; ++n)
for (long o = 0; o < 24; ++o)
for (long p = 0; p < 24; ++p)
arr_350[m][n][o * 24 + p] = 2245753719890332927;
test(78403368, 1, d, e, f, g, h, i, j, k);
for (long m = 0; m < 24; ++m)
for (long n = 0; n < 24; ++n)
for (long o = 0; o < 24; ++o)
for (long p = 0; p < 24; ++p)
b(&a, arr_350[m][n][o * 24 + p]);
printf("%llu\n", a);
}
B.ii :
extern bool var_157;
extern int var_158;
extern char var_159;
extern int a;
extern bool var_166;
extern long arr_350[][24][24][24];
short b;
long q;
short c;
int s;
void test(int d, bool e, int f[][24], short g[][24],
signed char h[][24][24][24], unsigned char i[][24],
short j[][24][24][24], unsigned long long k[][24][24],
signed char l[][24][24][24], unsigned long long m[][24][24][24]) {
for (char n = 1; n < (char)d; n += 4)
for (char o = 0; o < 24; o++)
for (bool p = 0; p < 1; p = 2)
#pragma clang loop vectorize_predicate(enable)
for (short r = 0; r < 24; r += e + 4) {
arr_350[n + 1][o][p][r] = ~f[n][1];
if (h[1][n][p][1] != h[p][p][p][1])
b = k[2][p][1];
var_157 = var_158 = var_159 = q = m[1][1][1][n] / l[1][1][1][2];
a = c = i[n][1];
var_166 = s = j[n][3][2][2] ? g[3][8] : 0;
}
}
When compile with command: clang++ -std=c++11 --target=riscv64-unknown-linux -march=rv64imafdcv -O0 -o a.out A.ii B.ii
, It returns 14404730799330978103 at run time.
However, when compile with command: clang++ -std=c++11 -fPIC -w --target=riscv64-unknown-linux -march=rv64imafdcv -O3 -mllvm -enable-loop-distribute -mllvm -loop-distribute-non-if-convertible=true -o a.out A.ii B.ii
It returns 13460671840389111898 which is incorrect.
Recently when I'm trying loop vectorize on llvm 17.x version, in a certain case I found when storing value into array sometimes it stored into wrong position. Here is a case reduced from the case I met with the problem:
When compile with flag: -std=c++11 --target=riscv64-unknown-linux -march=rv64imafdcv -fPIC -w -fopenmp-simd -O3 there is
and %1 is the stride equal to (int)f + 7 . When turns on debug flag there is debug log:
LAA: Replacing SCEV: {@b,+,%1}<nw><%for.body> by: {@b,+,1}<nw><%for.body>
and in vplan there isvp<%4> = SCALAR-STEPS vp<%2>, ir<1>
That is the stride is wrongly replaced by 1.In my opinion such versioning should be applied on the strides of symbolically striding memory accesses. So I checked the source code and found the symbolic stride is collected in getStrideFromPointer:
The value V is regard as symbolic value when dyn_cast(V) returns a SCEVUnknown. However, in the case I post the stride has pattern:
which is not supported in createSCEV on branch 17.x, and createSCEV return getUnknown when processing 1%, making 1% regard as symbolic stride value and replaced by 1 later.
I checked source code on main branch and found the ashr pattern is supported, so the case I post will not report error. But still in any case dyn_cast(V) returns a SCEVUnknown will return V as symbolic value, and this might be a risk.
I think returning a SCEVUnknown is not equal to regarding V as symbolic value because unsupported patterns in createSCEV will also return a SCEVUnknown, and in my opinion there should be an extra check to stop these unsupported pattern value being regard as symbolic value.