Open davidbolvansky opened 4 years ago
mentioned in issue llvm/llvm-bugzilla-archive#50256
Looks like another variation of bug 30787 - when 'i' is 0 on the first iteration of the loop, we lose the add op, so SLP only recognizes the later ops for vectorization:
define dso_local void @_Z5main1j(i32 %0) local_unnamed_addr #0 { %2 = add i32 %0, 1 %3 = add i32 %0, 2 %4 = load <4 x i32>, <4 x i32> bitcast ([16 x i32] @in to <4 x i32>), align 16, !tbaa !2 %5 = add i32 %0, 3 %6 = insertelement <4 x i32> undef, i32 %0, i32 0 %7 = insertelement <4 x i32> %6, i32 %2, i32 1 %8 = insertelement <4 x i32> %7, i32 %3, i32 2 %9 = insertelement <4 x i32> %8, i32 %5, i32 3 %10 = add <4 x i32> %4, %9 store <4 x i32> %10, <4 x i32> bitcast ([16 x i32] @out to <4 x i32>), align 16, !tbaa !2 %11 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @in, i64 0, i64 4) to <4 x i32>), align 16, !tbaa !2 %12 = shufflevector <4 x i32> %6, <4 x i32> undef, <4 x i32> zeroinitializer %13 = add <4 x i32> %12, <i32 4, i32 5, i32 6, i32 7> %14 = add <4 x i32> %13, %11 store <4 x i32> %14, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @out, i64 0, i64 4) to <4 x i32>), align 16, !tbaa !2 %15 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @in, i64 0, i64 8) to <4 x i32>), align 16, !tbaa !2 %16 = add <4 x i32> %12, <i32 8, i32 9, i32 10, i32 11> %17 = add <4 x i32> %16, %15 store <4 x i32> %17, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @out, i64 0, i64 8) to <4 x i32>), align 16, !tbaa !2 %18 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @in, i64 0, i64 12) to <4 x i32>), align 16, !tbaa !2 %19 = add <4 x i32> %12, <i32 12, i32 13, i32 14, i32 15> %20 = add <4 x i32> %19, %18 store <4 x i32> %20, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @out, i64 0, i64 12) to <4 x i32>), align 16, !tbaa !2 ret void }
Extended Description
define N 16
unsigned int out[N]; unsigned int in[N];
void main1 (unsigned int x) {
for (int i = 0; i < N; ++i) out[i] = (in[i] + i) + x;
}
ICC: Dispatch Width: 6 uOps Per Cycle: 5.87 IPC: 4.70 Block RThroughput: 5.0
Clang: Dispatch Width: 6 uOps Per Cycle: 4.41 IPC: 3.06 Block RThroughput: 8.0
https://godbolt.org/z/vcEjEa