llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.3k stars 12.11k forks source link

Suboptimal partial vectorization #46898

Open davidbolvansky opened 4 years ago

davidbolvansky commented 4 years ago
Bugzilla Link 47554
Version trunk
OS Linux
CC @fhahn,@RKSimon,@rotateright

Extended Description

define N 16

unsigned int out[N]; unsigned int in[N];

void main1 (unsigned int x) {

for (int i = 0; i < N; ++i) out[i] = (in[i] + i) + x;

}

ICC: Dispatch Width: 6 uOps Per Cycle: 5.87 IPC: 4.70 Block RThroughput: 5.0

Clang: Dispatch Width: 6 uOps Per Cycle: 4.41 IPC: 3.06 Block RThroughput: 8.0

https://godbolt.org/z/vcEjEa

davidbolvansky commented 3 years ago

mentioned in issue llvm/llvm-bugzilla-archive#50256

rotateright commented 4 years ago

Looks like another variation of bug 30787 - when 'i' is 0 on the first iteration of the loop, we lose the add op, so SLP only recognizes the later ops for vectorization:

define dso_local void @​_Z5main1j(i32 %0) local_unnamed_addr #​0 { %2 = add i32 %0, 1 %3 = add i32 %0, 2 %4 = load <4 x i32>, <4 x i32> bitcast ([16 x i32] @​in to <4 x i32>), align 16, !tbaa !​2 %5 = add i32 %0, 3 %6 = insertelement <4 x i32> undef, i32 %0, i32 0 %7 = insertelement <4 x i32> %6, i32 %2, i32 1 %8 = insertelement <4 x i32> %7, i32 %3, i32 2 %9 = insertelement <4 x i32> %8, i32 %5, i32 3 %10 = add <4 x i32> %4, %9 store <4 x i32> %10, <4 x i32> bitcast ([16 x i32] @​out to <4 x i32>), align 16, !tbaa !​2 %11 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​in, i64 0, i64 4) to <4 x i32>), align 16, !tbaa !​2 %12 = shufflevector <4 x i32> %6, <4 x i32> undef, <4 x i32> zeroinitializer %13 = add <4 x i32> %12, <i32 4, i32 5, i32 6, i32 7> %14 = add <4 x i32> %13, %11 store <4 x i32> %14, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​out, i64 0, i64 4) to <4 x i32>), align 16, !tbaa !​2 %15 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​in, i64 0, i64 8) to <4 x i32>), align 16, !tbaa !​2 %16 = add <4 x i32> %12, <i32 8, i32 9, i32 10, i32 11> %17 = add <4 x i32> %16, %15 store <4 x i32> %17, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​out, i64 0, i64 8) to <4 x i32>), align 16, !tbaa !​2 %18 = load <4 x i32>, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​in, i64 0, i64 12) to <4 x i32>), align 16, !tbaa !​2 %19 = add <4 x i32> %12, <i32 12, i32 13, i32 14, i32 15> %20 = add <4 x i32> %19, %18 store <4 x i32> %20, <4 x i32> bitcast (i32 getelementptr inbounds ([16 x i32], [16 x i32] @​out, i64 0, i64 12) to <4 x i32>), align 16, !tbaa !​2 ret void }