Missed optimizations for comparison of doubles in loops

davidbolvansky commented 6 years ago


Bugzilla Link	37313
Version	6.0
OS	Linux
CC	@hfinkel,@rotateright

Extended Description

Hello,

double test(double *arr, int size) { double m = INFINITY; int flag = 1; int i; double a;

for (i=0; i<size;++i) {
    a = arr[i];
    if (a <= m ) {
        m = a;
        flag = 0;
    }
    ++i;
}
if (flag) {
    return 0.0;
}
return m;

}

ASM outputs: https://godbolt.org/g/iSjSGW

Seems like Clang generates worse code because LLVM missed same patterns..

davidbolvansky commented 6 years ago

Another interesting case is when we change it to "double m = NAN;"

https://godbolt.org/g/AhxoqX

rotateright commented 6 years ago

So here in this case GCC is wrong? Since ignores NaN?

It doesn't produce minsd in the example that Gordon posted (even if you add -ffast-math to that compilation): https://godbolt.org/g/Ab67Lc

davidbolvansky commented 6 years ago

"Select" issue.. I can help and test your patches if any, even on SPEC2006.

davidbolvansky commented 6 years ago

So here in this case GCC is wrong? Since ignores NaN?

rotateright commented 6 years ago

We should check it too whether is possible to move some optimizations from "-ffast-math" safely.

This case seems like the right candidate because it helps quite a lot (as we can see from time measures) and there is nothing which could break after this change, or am I wrong?

Experts in this area are needed here :)

You're wrong. :)

-ffast-math (or more accurately -ffinite-math-only...although I'm not sure if that will just uncover other bugs) changes the semantics of the compare:

%12 = fcmp fast ugt double %11, %9 %13 = select i1 %12, double %9, double %11

This is "minsd" only if we can ignore NAN values. By default, you can't, so we're forced to produce the 'blendv' (or its pre-AVX equivalent) when we're operating with strict FP semantics. See the Intel manual for exactly what is possible with x86 instructions.

I think you'd be better off investigating whether we should limit the select creation in SimplifyCFG (or reverse it in some later pass). That's the bigger issue.

davidbolvansky commented 6 years ago

llvm mca for code compiled by gcc (1) https://pastebin.com/raw/VebrMbCF

and by clang + fast-math (2) https://pastebin.com/raw/u7FSndj9

no fast-math (3) https://pastebin.com/raw/kPm3R6fm

almost 2x more cycles when comparing (1) with (3).

davidbolvansky commented 6 years ago

With -ffast-math:

xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m4,258s user 0m4,251s sys 0m0,000s xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m4,393s user 0m4,384s sys 0m0,004s

Still, we miss branching here, which helps a lot.

davidbolvansky commented 6 years ago

We should check it too whether is possible to move some optimizations from "-ffast-math" safely.

This case seems like the right candidate because it helps quite a lot (as we can see from time measures) and there is nothing which could break after this change, or am I wrong?

Experts in this area are needed here :)

llvmbot commented 6 years ago

Note that if you enable -ffast-math the clang code becomes close to the GCC output.

https://godbolt.org/g/Ab67Lc

I suspect they're possibly enabling it at -O3 in GCC?

Just another angle to look at.

davidbolvansky commented 6 years ago

Yes, SimplifyCFG is good for this branch + phi case..

rotateright commented 6 years ago

When you have 2 or more selects that use the same condition value, should we convert them into branch + phi?

You might also invert that question: which pass thought it was a good idea to form 2 selects from the branch + phis?

Most likely answer: SimplifyCFG.

davidbolvansky commented 6 years ago

Tested on:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 60 Model name: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz Stepping: 3 CPU MHz: 1404.472 CPU max MHz: 3600,0000 CPU min MHz: 800,0000 BogoMIPS: 5188.11 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti retpoline spec_ctrl tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts

rotateright commented 6 years ago

That looks compelling so far, but you must specify exactly which CPU you are testing on too.

For your test case, there's a more general possibility:

%12 = fcmp ugt double %11, %9 %13 = select i1 %12, double %9, double %11 %14 = select i1 %12, i32 %8, i32 0

When you have 2 or more selects that use the same condition value, should we convert them into branch + phi?

davidbolvansky commented 6 years ago

and results for case for "if (a < m )"...

xbolva00@xbolva00:~/opt$ clang -O3 i.c xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m4,576s user 0m4,576s sys 0m0,000s xbolva00@xbolva00:~/opt$ gcc -O3 i.c xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m2,442s user 0m2,442s sys 0m0,000s

davidbolvansky commented 6 years ago

with avx

xbolva00@xbolva00:~/opt$ clang -Xclang -disable-O0-optnone -S -emit-llvm i.c xbolva00@xbolva00:~/opt$ opt -O3 -mattr=avx i.ll -S > opt.ll xbolva00@xbolva00:~/opt$ llc -O3 -mattr=avx opt.ll xbolva00@xbolva00:~/opt$ clang opt.s -o dr xbolva00@xbolva00:~/opt$ time ./dr 5

real 0m6,576s user 0m6,572s sys 0m0,004s

davidbolvansky commented 6 years ago

Oki, so simple test program...

double test(double *arr, int size, double p) { double m = INFINITY; int flag = 1; int i; double a;

for (i=0; i<size;++i) {
    a = arr[i];
    if (a <= m ) {
        m = a;
        flag = 0;
    }
    ++i;
}
if (flag) {
    return 0.0;
}
return p + m;

}

int main(int argc, char **argv) { size_t len = 50000; double arr[50000] = {0.0,}; srand(time(NULL));

for (unsigned i = 0; i < len; ++i)
        arr[i] = rand();

double res = 1;
for (unsigned i = 0; i < 80000 * argc; ++i)
    res = test(arr, len, res);

return res;

}

Results:

xbolva00@xbolva00:~/opt$ gcc -O3 i.c xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m2,466s user 0m2,465s sys 0m0,001s xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m2,422s user 0m2,418s sys 0m0,004s xbolva00@xbolva00:~/opt$ clang -O3 i.c xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m7,865s user 0m7,861s sys 0m0,004s xbolva00@xbolva00:~/opt$ time ./a.out 5

real 0m7,871s user 0m7,866s sys 0m0,004s

rotateright commented 6 years ago

This is an unfortunate consequence of the x86 ISA. You have to choose which of these you like better: https://godbolt.org/g/oBLyEz

If you trust your branch predictor, then obviously the version with a branch is better.

But what about if you add "-mattr=avx" to that compilation? Now you have "vblendvpd" to make the select less ugly (but might be worse latency depending on uarch).

But we're not done yet...change the comparison predicate to: if (a < m ) {

Now, LLVM produces "minsd". Is that better than gcc?

In summary: bad architecture and uarch implementation makes the compiler's job hard. You'll have to measure the actual perf for all the options on different CPUs and with different data sets to know which version of the code is better.

llvm / llvm-project

Missed optimizations for comparison of doubles in loops #36661

Extended Description