Open llvmbot opened 7 years ago
Third time's the charm: the difference isn't in phis, it's the comparison operator, the following example exhibits the "loops remain after -O3" problem (by using < instead of !=):
using namespace std;
typedef long long i64;
i64 add_to_mat2d_elt(i64 x) { // Ugly initialization with all 0. array<array<i64, 2>, 2> ab; { array<i64, 2> tmp;
for(auto *p = &tmp[0]; p < &tmp[2]; p++)
*p = 0;
for(auto *p = &ab[0]; p < &ab[2]; p++)
*p = tmp;
}
return ab[0][0] + x; }
Unoptimized IR for add_to_mat2d_elt. My bad, I forgot that the unoptimized IR my case already had phis for the loops, so the comparison is not exactly fair.
I am attaching the unoptimized IR for the testcase that should optimize the same as the following C++ program:
using namespace std;
typedef long long i64;
i64 add_to_mat2d_elt(i64 x) { // Ugly initialization with all 0. array<array<i64, 2>, 2> ab; { array<i64, 2> tmp;
for(auto *p = &tmp[0]; p != &tmp[2]; p++)
*p = 0;
for(auto *p = &ab[0]; p != &ab[2]; p++)
*p = tmp;
}
return ab[0][0] + x; }
Here is what I see with a current clang at O1/O2/O3 for this example:
%"struct.std::1::array" = type { [2 x %"struct.std::1::array.0"] } %"struct.std::__1::array.0" = type { [2 x i32] }
; Function Attrs: ssp uwtable define void @_Z3foov() local_unnamed_addr #0 { entry: %ab = alloca %"struct.std::1::array", align 4 %0 = bitcast %"struct.std::1::array" %ab to i8 call void @llvm.lifetime.start(i64 16, i8 %0) #3 call void @llvm.memset.p0i8.i64(i8 %0, i8 0, i64 16, i32 4, i1 false) call void @_Z3barRNSt3__15arrayINS0_IiLm2EEELm2EEE(%"struct.std::__1::array" nonnull dereferenceable(16) %ab) call void @llvm.lifetime.end(i64 16, i8 %0) #3 ret void }
I found an even worse case, with unoptimized branches on constants (i1 true) and a whole barely touched loop. I believe (but I couldn't verify yet) that this C++ snippet will reproduce:
using namespace std;
void bar(array<array<int, 2>, 2>&);
void foo() { array<array<int, 2>, 2> ab; for(auto p = &ab[0]; p != &ab[2]; p++) for(auto q = &(p)[0]; q != &(p)[2]; q++) *q = 0; bar(ab); }
My first inclination is to say that the SCEV fix itself is correct, but some parts of LLVM (including SCEV) needs to get smarter to cope with r273079. I'll take a closer look tonight.
bisection points to: [SCEV] Fix incorrect trip count computation
The way we elide max expressions when computing trip counts is incorrect
-- it breaks cases like this:
```
static int wrapping_add(int a, int b) {
return (int)((unsigned)a + (unsigned)b);
}
void test() {
volatile int end_buf = 2147483548; // INT_MIN - 100
int end = end_buf;
unsigned counter = 0;
for (int start = wrapping_add(end, 200); start < end; start++)
counter++;
print(counter);
}
```
Sanjoy, can you please take a look?
Extended Description
The following function is optimized out in clang 3.8, but not 3.9:
void test() { double a[4]; for(double p = a, end = a + 4; p < end; p++) *p = 0.0; }
The optimized IR on 3.9 is more or less:
define void @test() { entry-block: %a = alloca [4 x double], align 8 %a6 = ptrtoint [4 x double] %a to i64 %0 = bitcast [4 x double] %a to i8 call void @llvm.lifetime.start(i64 32, i8 %0) %scevgep = getelementptr [4 x double], [4 x double] %a, i64 1, i64 0 %scevgep1 = ptrtoint double %scevgep to i64 %scevgep2 = getelementptr [4 x double], [4 x double] %a, i64 0, i64 1 %scevgep23 = ptrtoint double %scevgep2 to i64 %1 = icmp ugt i64 %scevgep1, %scevgep23 %umax = select i1 %1, i64 %scevgep1, i64 %scevgep23 %umax45 = inttoptr i64 %umax to i8 %2 = xor i64 %a6, -1 %uglygep = getelementptr i8, i8 %umax45, i64 %2 %uglygep7 = ptrtoint i8 %uglygep to i64 %3 = add i64 %uglygep7, 8 %4 = and i64 %3, -8 call void @llvm.memset.p0i8.i64(i8 %0, i8 0, i64 %4, i32 8, i1 false) call void @llvm.lifetime.end(i64 32, i8* %0) ret void }
If the llvm.lifetime.{start,end} calls are removed, running opt again will reduce this to just "ret void".
Any other uses of %a will also prevent the memset and everything before it to be removed, even if the entire array is statically overwritten.