Open dc81c6b5-3a5b-438e-b826-9e7edb3cf487 opened 7 years ago
Current Codegen: https://gcc.godbolt.org/z/wPOaz_
As written (low loop count), this seems like a missing reduction pattern for the SLP vectorizer.
If we bump up the loop counter, the loop vectorizer says:
$ ./clang -O2 -ffast-math 32077.c -mavx2 -S -Rpass-analysis=loop-vectorize 32077.c:3:3: remark: loop not vectorized: value that could not be identified as reduction is used outside the loop [-Rpass-analysis=loop-vectorize] for (int i = 0; i < 1600; i++)
Extended Description
Consider:
double f(double x[]) { float p = 1.0; for (int i = 0; i < 16; i++) p += x[i]; return p; }
clang/llvm with -O3 -march=core-avx2 -ffast-math gives:
.LCPI0_0: .quad 4607182418800017408 # double 1 f: # @f vmovsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero vaddsd xmm0, xmm0, qword ptr [rip + .LCPI0_0] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 8] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 16] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 24] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 32] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 40] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 48] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 56] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 64] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 72] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 80] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 88] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 96] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 104] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 112] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 vaddsd xmm0, xmm0, qword ptr [rdi + 120] vcvtsd2ss xmm0, xmm0, xmm0 vcvtss2sd xmm0, xmm0, xmm0 ret
However more efficient would be:
f: vcvtpd2ps xmm0, YMMWORD PTR [rdi] #4.5 vcvtpd2ps xmm1, YMMWORD PTR [32+rdi] #4.5 vcvtpd2ps xmm2, YMMWORD PTR [64+rdi] #4.5 vcvtpd2ps xmm3, YMMWORD PTR [96+rdi] #4.5 vaddps xmm4, xmm0, xmm1 #2.11 vaddps xmm5, xmm2, xmm3 #2.11 vaddps xmm6, xmm4, xmm5 #2.11 vmovhlps xmm7, xmm6, xmm6 #2.11 vaddps xmm8, xmm6, xmm7 #2.11 vshufps xmm9, xmm8, xmm8, 245 #2.11 vaddss xmm10, xmm8, xmm9 #2.11 vaddss xmm0, xmm10, DWORD PTR .L_2il0floatpacket.0[rip] #2.11 vcvtss2sd xmm0, xmm0, xmm0 #5.10 vzeroupper #5.10 ret #5.10 .L_2il0floatpacket.0: .long 0x3f800000