llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.16k stars 12.03k forks source link

Avoid reloads after vectorizer's runtime alias check #42456

Open davidbolvansky opened 5 years ago

davidbolvansky commented 5 years ago
Bugzilla Link 43111
Version trunk
OS Linux
CC @topperc,@fhahn,@preames,@RKSimon,@rotateright

Extended Description

Consider loop:

int foo(int *p, int *q) {
    int s = 0;
    for (int i = 0; i < 256; ++i) {
        int k = p[i];
        q[i] = 2*k;
        s+= k;
        int o = p[i];
        s+= o;
    }
    return s;
}

Flags: clang -fno-unroll-loops -O3 loop.c -mavx2

In entry block there is a runtime check to check whether pointers do not alias. entry: %scevgep = getelementptr i32, i32 %q, i64 256 %scevgep23 = getelementptr i32, i32 %p, i64 256 %bound0 = icmp ugt i32 %scevgep23, %q %bound1 = icmp ugt i32 %scevgep, %p %found.conflict = and i1 %bound0, %bound1 br i1 %found.conflict, label %for.body, label %vector.body

Then, in loop body, we know there is no aliasing.

vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %7, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !​2, !alias.scope !​6 %2 = shl nsw <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !​2, !alias.scope !​9, !noalias !​6 %5 = add nsw <8 x i32> %wide.load, %vec.phi %6 = bitcast i32 %0 to <8 x i32> %wide.load25 = load <8 x i32>, <8 x i32>* %6, align 4, !tbaa !​2, !alias.scope !​6 ...

But we still reload p[i] '%wide.load25'..

Expected IR: vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %5, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !​2, !alias.scope !​6 %2 = shl <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !​2, !alias.scope !​9, !noalias !​6 ...

davidbolvansky commented 5 years ago

With typical Clang -O3, the problem should still be there, see '%wide.load28'.

vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next.1, %vector.body ], [ 0, %entry ] %vec.phi = phi <4 x i32> [ %30, %vector.body ], [ zeroinitializer, %entry ] %vec.phi26 = phi <4 x i32> [ %31, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <4 x i32> %wide.load = load <4 x i32>, <4 x i32> %1, align 4, !tbaa !​2, !alias.scope !​6 %2 = getelementptr inbounds i32, i32 %0, i64 4 %3 = bitcast i32 %2 to <4 x i32> %wide.load27 = load <4 x i32>, <4 x i32> %3, align 4, !tbaa !​2, !alias.scope !​6 %4 = shl nsw <4 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1> %5 = shl nsw <4 x i32> %wide.load27, <i32 1, i32 1, i32 1, i32 1> %6 = getelementptr inbounds i32, i32 %q, i64 %index %7 = bitcast i32 %6 to <4 x i32> store <4 x i32> %4, <4 x i32> %7, align 4, !tbaa !​2, !alias.scope !​9, !noalias !​6 %8 = getelementptr inbounds i32, i32 %6, i64 4 %9 = bitcast i32 %8 to <4 x i32> store <4 x i32> %5, <4 x i32> %9, align 4, !tbaa !​2, !alias.scope !​9, !noalias !​6 %10 = add nsw <4 x i32> %wide.load, %vec.phi %11 = add nsw <4 x i32> %wide.load27, %vec.phi26 %12 = bitcast i32 %0 to <4 x i32> %wide.load28 = load <4 x i32>, <4 x i32>* %12, align 4, !tbaa !​2, !alias.scope !​6

davidbolvansky commented 5 years ago

Strange, with '#pragma clang loop unroll(disable)' loads look optimal.

vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %5, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !​2, !alias.scope !​6 %2 = shl <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !​2, !alias.scope !​9, !noalias !​6 %5 = add <8 x i32> %2, %vec.phi %index.next = add i64 %index, 8 %6 = icmp eq i64 %index.next, 256 br i1 %6, label %middle.block, label %vector.body, !llvm.loop !​11

fhahn commented 2 years ago

Looks like InstCombine removes the extra load in the -fno-unroll-loops case, but not when the vector loop is interleaved