Open davidbolvansky opened 5 years ago
With typical Clang -O3, the problem should still be there, see '%wide.load28'.
vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next.1, %vector.body ], [ 0, %entry ] %vec.phi = phi <4 x i32> [ %30, %vector.body ], [ zeroinitializer, %entry ] %vec.phi26 = phi <4 x i32> [ %31, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <4 x i32> %wide.load = load <4 x i32>, <4 x i32> %1, align 4, !tbaa !2, !alias.scope !6 %2 = getelementptr inbounds i32, i32 %0, i64 4 %3 = bitcast i32 %2 to <4 x i32> %wide.load27 = load <4 x i32>, <4 x i32> %3, align 4, !tbaa !2, !alias.scope !6 %4 = shl nsw <4 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1> %5 = shl nsw <4 x i32> %wide.load27, <i32 1, i32 1, i32 1, i32 1> %6 = getelementptr inbounds i32, i32 %q, i64 %index %7 = bitcast i32 %6 to <4 x i32> store <4 x i32> %4, <4 x i32> %7, align 4, !tbaa !2, !alias.scope !9, !noalias !6 %8 = getelementptr inbounds i32, i32 %6, i64 4 %9 = bitcast i32 %8 to <4 x i32> store <4 x i32> %5, <4 x i32> %9, align 4, !tbaa !2, !alias.scope !9, !noalias !6 %10 = add nsw <4 x i32> %wide.load, %vec.phi %11 = add nsw <4 x i32> %wide.load27, %vec.phi26 %12 = bitcast i32 %0 to <4 x i32> %wide.load28 = load <4 x i32>, <4 x i32>* %12, align 4, !tbaa !2, !alias.scope !6
Strange, with '#pragma clang loop unroll(disable)' loads look optimal.
vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %5, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !2, !alias.scope !6 %2 = shl <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !2, !alias.scope !9, !noalias !6 %5 = add <8 x i32> %2, %vec.phi %index.next = add i64 %index, 8 %6 = icmp eq i64 %index.next, 256 br i1 %6, label %middle.block, label %vector.body, !llvm.loop !11
Looks like InstCombine
removes the extra load in the -fno-unroll-loops
case, but not when the vector loop is interleaved
Extended Description
Consider loop:
Flags: clang -fno-unroll-loops -O3 loop.c -mavx2
In entry block there is a runtime check to check whether pointers do not alias. entry: %scevgep = getelementptr i32, i32 %q, i64 256 %scevgep23 = getelementptr i32, i32 %p, i64 256 %bound0 = icmp ugt i32 %scevgep23, %q %bound1 = icmp ugt i32 %scevgep, %p %found.conflict = and i1 %bound0, %bound1 br i1 %found.conflict, label %for.body, label %vector.body
Then, in loop body, we know there is no aliasing.
vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %7, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !2, !alias.scope !6 %2 = shl nsw <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !2, !alias.scope !9, !noalias !6 %5 = add nsw <8 x i32> %wide.load, %vec.phi %6 = bitcast i32 %0 to <8 x i32> %wide.load25 = load <8 x i32>, <8 x i32>* %6, align 4, !tbaa !2, !alias.scope !6 ...
But we still reload p[i] '%wide.load25'..
Expected IR: vector.body: ; preds = %entry, %vector.body %index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ] %vec.phi = phi <8 x i32> [ %5, %vector.body ], [ zeroinitializer, %entry ] %0 = getelementptr inbounds i32, i32 %p, i64 %index %1 = bitcast i32 %0 to <8 x i32> %wide.load = load <8 x i32>, <8 x i32> %1, align 4, !tbaa !2, !alias.scope !6 %2 = shl <8 x i32> %wide.load, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %3 = getelementptr inbounds i32, i32 %q, i64 %index %4 = bitcast i32 %3 to <8 x i32> store <8 x i32> %2, <8 x i32> %4, align 4, !tbaa !2, !alias.scope !9, !noalias !6 ...