llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.97k stars 11.54k forks source link

Missing optimisation: Type conversion not vectorised in simple additive reduction #31425

Open dc81c6b5-3a5b-438e-b826-9e7edb3cf487 opened 7 years ago

dc81c6b5-3a5b-438e-b826-9e7edb3cf487 commented 7 years ago
Bugzilla Link 32077
Version trunk
OS Linux
CC @lesshaste,@RKSimon,@rotateright

Extended Description

Consider:

double f(double x[]) { float p = 1.0; for (int i = 0; i < 16; i++) p += x[i]; return p; }

clang/llvm with -O3 -march=core-avx2 -ffast-math gives:

.LCPI0_0: .quad 4607182418800017408 # double 1 f: # @​f vmovsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero vaddsd xmm0, xmm0, qword ptr [rip + .LCPI0_0] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 8] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 16] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 24] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 32] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 40] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 48] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 56] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 64] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 72] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 80] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 88] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 96] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 104] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 112] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 120] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 ret

However more efficient would be:

f: vcvtpd2ps xmm0, YMMWORD PTR [rdi] #​4.5 vcvtpd2ps xmm1, YMMWORD PTR [32+rdi] #​4.5 vcvtpd2ps xmm2, YMMWORD PTR [64+rdi] #​4.5 vcvtpd2ps xmm3, YMMWORD PTR [96+rdi] #​4.5 vaddps xmm4, xmm0, xmm1 #​2.11 vaddps xmm5, xmm2, xmm3 #​2.11 vaddps xmm6, xmm4, xmm5 #​2.11 vmovhlps xmm7, xmm6, xmm6 #​2.11 vaddps xmm8, xmm6, xmm7 #​2.11 vshufps xmm9, xmm8, xmm8, 245 #​2.11 vaddss xmm10, xmm8, xmm9 #​2.11 vaddss xmm0, xmm10, DWORD PTR .L_2il0floatpacket.0[rip] #​2.11 vcvtss2sd xmm0, xmm0, xmm0 #​5.10 vzeroupper #​5.10 ret #​5.10 .L_2il0floatpacket.0: .long 0x3f800000

RKSimon commented 5 years ago

Current Codegen: https://gcc.godbolt.org/z/wPOaz_

rotateright commented 7 years ago

As written (low loop count), this seems like a missing reduction pattern for the SLP vectorizer.

If we bump up the loop counter, the loop vectorizer says:

$ ./clang -O2 -ffast-math 32077.c -mavx2 -S -Rpass-analysis=loop-vectorize 32077.c:3:3: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize] for (int i = 0; i < 1600; i++)