[x86] scalar FP code runs ~15% slower on Haswell when compiled with -mavx

rotateright commented 6 years ago


Bugzilla Link	36180
Version	trunk
OS	All
Attachments	himeno.c source file
CC	@aelovikov-intel,@topperc,@echristo,@delena,@RKSimon,@ZviRackover

Extended Description

I have a Haswell perf mystery that I can't explain. The himeno program (see attachment) is an FP and memory benchmark that plows through large multi-dimensional arrays doing 32-bit fadd/fsub/fmul.

To eliminate potentially questionable transforms and variation from the vectorizers, build it as scalar-ops only like this:

$ ./clang -O2 himeno.c -fno-vectorize -fno-slp-vectorize -o himeno_novec_sse $ ./clang -O2 himeno.c -fno-vectorize -fno-slp-vectorize -mavx -o himeno_novec_avx

And I'm testing on a 4.0GHz Haswell iMac running macOS 10.13.3:

$ ./himeno_novec_sse mimax = 257 mjmax = 129 mkmax = 129 imax = 256 jmax = 128 kmax =128 cpu : 13.244777 sec. Loop executed for 500 times Gosa : 9.897132e-04 MFLOPS measured : 5175.818966 Score based on MMX Pentium 200MHz : 160.391043

$ ./himeno_novec_avx mimax = 257 mjmax = 129 mkmax = 129 imax = 256 jmax = 128 kmax =128 cpu : 15.533612 sec. Loop executed for 500 times Gosa : 9.897132e-04 MFLOPS measured : 4413.176279 Score based on MMX Pentium 200MHz : 136.757864

There's an unfortunate amount of noise (~5%) in the perf on this system with this benchmark, but these results are reproducible. I'm consistently seeing ~15% better perf with the non-AVX build.

If we look at the inner loop asm, they are virtually identical in terms of operations. The SSE code just has a few extra instructions needed to copy values because of the destructive ops, but the loads, stores, and math are the same.

A IACA analysis of these loops says they should have virtually the same throughput on HSW:

Block Throughput: 20.89 Cycles       Throughput Bottleneck: Backend
Loop Count:  22
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles | 13.0     0.0  | 21.0  | 12.0    12.0  | 12.0    11.0  |  1.0  |  2.0  |  2.0  |  0.0  |

llvmbot commented 6 years ago

On Tue, Apr 10, 2018 at 11:26:43PM +0000, bugzilla-daemon@llvm.org wrote:

llvm/llvm-project#35528

--- Comment #9 from Craig Topper craig.topper@gmail.com --- Yeah I thought the placement in the manual was odd. I think SKL is the same as
Haswell here. I think prior to Haswell, having an index always caused an unlamination. Now its dependent on the number of sources needed.

Yes, HSW and SKL are the same in my testing, and it matches what Intel has finally gotten around to documenting. :)

Except their terminology sucks: they could have said a 3 operand max, instead of a 3 source max, because they're including a write-only destination as a source! (Of course, if they were good at terminology, they would have called it delamination. But apparently using normal English words was un-possible.)

But anyway, total number of separate operands is a simple rule that fits everything I tested.

I hadn't noticed the un-lamination for cmp reg,[base+idx] in my testing, though. That's subtle. add reg,[base+idx] stays fused because reg is a read-write operand, even though it does also write flags. (I tested just now on SKL, and that's real: cmp un-laminates, add stays fused.)

In the SKL/HSW/SnB section of Intel's optimization manual, stuff mentioned for one uarch applies to later ones, too, unless overriden by something in the section for a later uarch.

It's unfortunate that the decoders can't recognize the special case of destination = first source as a single read-write operand to enable macro-fusion for VEX/EVEX in cases like vmulps xmm1,xmm1,[b+idx]

But maybe that's also how the uop format differentiates legacy-SSE from VEX/EVEX that zero-extend into the full vector reg.

topperc commented 6 years ago

Yeah I thought the placement in the manual was odd. I think SKL is the same as Haswell here. I think prior to Haswell, having an index always caused an unlamination. Now its dependent on the number of sources needed.

rotateright commented 6 years ago

The April 2018 version of the Intel Optimization Manual now has the information you are looking for in Section 2.3.5

Thanks! But is this saying that unlamination is only a factor on Haswell uarch? That contradicts Peter's statement in comment 4 about Skylake uarch.

topperc commented 6 years ago

The April 2018 version of the Intel Optimization Manual now has the information you are looking for in Section 2.3.5

llvmbot commented 6 years ago

Solving this bug should start with Intel fixing the docs and IACA.

Yes, that would be nice :)

I'm not going to take uarch-based shots-in-the-dark to try to solve this with compiler hacks. If someone else wants to take that on, feel free.

It might be a good idea to treat indexed addressing modes for non-mov/movzx/movsx instructions as more expensive than normal if tuning specifically for snb/ivb. (Pure load uops include broadcast-loads, though, and don't need to avoid indexed addressing modes.)

And at least for VEX ALU instructions on later members of the SnB family; that would be a reasonable approximation of when it's more expensive, even if it misses PABSB and so on.

Note that stores have separate store-data/store-address uops that can micro-fuse (except in some cases like vextracti128, or some ALU+store instructions on some CPUs). And only non-indexed stores can use port7, so there can be advantages to non-indexed stores even on Haswell (which can keep them micro-fused).

In a copy+modify loop that can't / shouldn't fold the loads into an ALU instruction (because the same data is needed multiple times), it can make sense to address the src data relative to the dst, so the loads use indexed addressing modes with (src_base - dst_base) in one register and current_dst in another register to produce current_src. And current_dst is used directly for stores. Then the loop overhead is only one pointer-increment + cmp / jb against an end-pointer.

Anyway, indexed addressing modes are valuable tools for saving total uops, especially when not unrolling, so we don't want to just increase their "cost". To make optimal code, we're going to need a model that knows about which instructions can use indexed addressing modes cheaply and which can't. If LLVM can't currently do that, then we can start thinking about how to implement that now, while waiting for Intel to document it better. (Or for me to clean up my SO answer posting more of my test results for more instructions...)

rotateright commented 6 years ago

And BTW, nobody has documented this anywhere else, AFAIK. Intel's optimization manual only mentions the SnB rules for un-lamination, without mentioning the HSW improvements. Agner Fog's guides don't even mention un-lamination at all.

Thanks, Peter!

Solving this bug should start with Intel fixing the docs and IACA. I'm not going to take uarch-based shots-in-the-dark to try to solve this with compiler hacks. If someone else wants to take that on, feel free.

llvmbot commented 6 years ago

All of the above applies to Skylake as well, BTW. I haven't found any micro-fusion differences between SKL and HSW, just from SnB to HSW.

And BTW, nobody has documented this anywhere else, AFAIK. Intel's optimization manual only mentions the SnB rules for un-lamination, without mentioning the HSW improvements. Agner Fog's guides don't even mention un-lamination at all.

llvmbot commented 6 years ago

From a quick look over the asm, I think the issue is that Haswell can micro-fuse an indexed addressing mode with SSE MULSS (2-operand destructive destination), but not with AVX VMULSS.

IACA doesn't know this, and applies the Sandybridge rules micro-fusion / un-lamination rules, so it's output is wrong for Haswell.

See my answer on https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes for the full details (although I have an unfinished edit that I should really post after discovering more patterns in what fuses and what doesn't).

The TL:DR is that Haswell introduced support for 3-input uops so FMA can be single-uop. Micro-fusion takes advantage of this, allowing some micro-fused ALU + load uops with an indexed addressing mode to stay micro-fused instead of un-laminating at issue.

But the only instructions that can stay micro-fused are instructions like add or paddd or mulps which have 2 operands and a read-write destination.

2-operand instructions with a write-only destination, like sqrtps (but not sqrtss), or pabsb but not paddb, will un-laminate, and so will 3-operand instructions even if the dest is the same as middle source operand.

rotateright commented 6 years ago

Here's the inner loop dump as SSE and AVX (built today via clang r323872):

SSE:

0000000100001750 mulss 0x41830(%r13,%rax,4), %xmm4 000000010000175a mulss 0x41834(%r13,%rax,4), %xmm1 0000000100001764 movaps %xmm5, %xmm9 0000000100001768 movaps %xmm11, %xmm8 000000010000176c movaps %xmm10, %xmm13 0000000100001770 addss %xmm4, %xmm1 0000000100001774 movss 0x10610(%rax,%r15), %xmm10 000000010000177e movss 0x41838(%r13,%rax,4), %xmm4 0000000100001788 mulss %xmm10, %xmm4 000000010000178d movss 0x20c14(%rax,%r15), %xmm5 0000000100001797 subss 0x2080c(%rax,%r15), %xmm5 00000001000017a1 subss 0x40c(%rax,%r15), %xmm5 00000001000017ab addss 0x4(%rax,%r15), %xmm5 00000001000017b2 addss %xmm1, %xmm4 00000001000017b6 mulss -0x8(%rdx,%r9), %xmm5 00000001000017bd movss 0x10814(%rax,%r15), %xmm1 00000001000017c7 movss 0x1040c(%rax,%r15), %xmm11 00000001000017d1 movaps %xmm1, %xmm3 00000001000017d4 subss %xmm11, %xmm3 00000001000017d9 subss 0x1080c(%rax,%r15), %xmm3 00000001000017e3 incq %r11 00000001000017e6 addss %xmm6, %xmm3 00000001000017ea mulss -0x4(%rdx,%r9), %xmm3 00000001000017f1 addss %xmm4, %xmm5 00000001000017f5 addss %xmm5, %xmm3 00000001000017f9 movss 0x20a14(%rax,%r15), %xmm4 0000000100001803 movss 0x20c(%rax,%r15), %xmm5 000000010000180d movaps %xmm4, %xmm6 0000000100001810 subss %xmm5, %xmm6 0000000100001814 subss 0x20a0c(%rax,%r15), %xmm6 000000010000181e addss %xmm7, %xmm6 0000000100001822 mulss (%rdx,%r9), %xmm6 0000000100001828 addss %xmm3, %xmm6 000000010000182c movss -0x8(%rcx,%r9), %xmm3 0000000100001833 mulss %xmm9, %xmm3 0000000100001838 addss %xmm6, %xmm3 000000010000183c movss -0x4(%rcx,%r9), %xmm6 0000000100001843 mulss %xmm8, %xmm6 0000000100001848 addss %xmm3, %xmm6 000000010000184c mulss (%rcx,%r9), %xmm2 0000000100001852 addss %xmm6, %xmm2 0000000100001856 addss 0x1060c(%rax,%rdi), %xmm2 000000010000185f mulss 0x4183c(%r13,%rax,4), %xmm2 0000000100001869 subss %xmm13, %xmm2 000000010000186e mulss 0x1060c(%rax,%r14), %xmm2 0000000100001878 movaps %xmm12, %xmm3 000000010000187c mulss %xmm2, %xmm3 0000000100001880 mulss %xmm2, %xmm2 0000000100001884 addss %xmm13, %xmm3 0000000100001889 movss %xmm3, 0x1060c(%rax,%r12) 0000000100001893 addss %xmm2, %xmm0 0000000100001897 addq $0x4, %rax 000000010000189b addq $0xc, %r9 000000010000189f movaps %xmm13, %xmm2 00000001000018a3 movaps %xmm9, %xmm7 00000001000018a7 movaps %xmm8, %xmm6 00000001000018ab cmpq %rbx, %r11 00000001000018ae jl 0x100001750

AVX:

0000000100001770 vmulss 0x41830(%r13,%rax,4), %xmm7, %xmm7 000000010000177a vmulss (%rcx,%r9), %xmm6, %xmm6 0000000100001780 vmovaps %xmm4, %xmm3 0000000100001784 vmovaps %xmm11, %xmm1 0000000100001788 vmovaps %xmm12, %xmm2 000000010000178c vmulss 0x41834(%r13,%rax,4), %xmm10, %xmm0 0000000100001796 vaddss %xmm0, %xmm7, %xmm4 000000010000179a vmovss 0x10610(%rax,%r15), %xmm12 00000001000017a4 vmulss 0x41838(%r13,%rax,4), %xmm12, %xmm5 00000001000017ae vmovss 0x20c14(%rax,%r15), %xmm7 00000001000017b8 vsubss 0x2080c(%rax,%r15), %xmm7, %xmm7 00000001000017c2 vsubss 0x40c(%rax,%r15), %xmm7, %xmm7 00000001000017cc vaddss 0x4(%rax,%r15), %xmm7, %xmm7 00000001000017d3 vmulss -0x8(%rdx,%r9), %xmm7, %xmm7 00000001000017da vaddss %xmm5, %xmm4, %xmm4 00000001000017de vaddss %xmm7, %xmm4, %xmm4 00000001000017e2 vmovss 0x10814(%rax,%r15), %xmm10 00000001000017ec vmovss 0x1040c(%rax,%r15), %xmm11 00000001000017f6 vsubss %xmm11, %xmm10, %xmm7 00000001000017fb vsubss 0x1080c(%rax,%r15), %xmm7, %xmm7 0000000100001805 vaddss %xmm8, %xmm7, %xmm7 000000010000180a vmulss -0x4(%rdx,%r9), %xmm7, %xmm7 0000000100001811 vaddss %xmm7, %xmm4, %xmm5 0000000100001815 vmovss 0x20a14(%rax,%r15), %xmm7 000000010000181f vmovss 0x20c(%rax,%r15), %xmm4 0000000100001829 vsubss %xmm4, %xmm7, %xmm0 000000010000182d vsubss 0x20a0c(%rax,%r15), %xmm0, %xmm0 0000000100001837 incq %r11 000000010000183a vaddss %xmm9, %xmm0, %xmm0 000000010000183f vmulss (%rdx,%r9), %xmm0, %xmm0 0000000100001845 vaddss %xmm0, %xmm5, %xmm0 0000000100001849 vmulss -0x8(%rcx,%r9), %xmm3, %xmm5 0000000100001850 vaddss %xmm5, %xmm0, %xmm0 0000000100001854 vmulss -0x4(%rcx,%r9), %xmm1, %xmm5 000000010000185b vaddss %xmm5, %xmm0, %xmm0 000000010000185f vaddss %xmm6, %xmm0, %xmm0 0000000100001863 vaddss 0x1060c(%rax,%rdi), %xmm0, %xmm0 000000010000186c vmulss 0x4183c(%r13,%rax,4), %xmm0, %xmm0 0000000100001876 vsubss %xmm2, %xmm0, %xmm0 000000010000187a vmulss 0x1060c(%rax,%r14), %xmm0, %xmm0 0000000100001884 vmulss %xmm0, %xmm0, %xmm5 0000000100001888 vmulss %xmm0, %xmm13, %xmm0 000000010000188c vaddss %xmm0, %xmm2, %xmm0 0000000100001890 vmovss %xmm0, 0x1060c(%rax,%r12) 000000010000189a vaddss %xmm5, %xmm14, %xmm14 000000010000189e addq $0x4, %rax 00000001000018a2 addq $0xc, %r9 00000001000018a6 vmovaps %xmm2, %xmm6 00000001000018aa vmovaps %xmm3, %xmm9 00000001000018ae vmovaps %xmm1, %xmm8 00000001000018b2 cmpq %rbx, %r11 00000001000018b5 jl 0x100001770

rotateright commented 6 years ago

My first guess was that code alignment was a factor, but I tried aligning the AVX loop all the way up to 128-bytes, and I don't see any perf difference.

Any other explanations/experiments/ideas?

llvm / llvm-project

[x86] scalar FP code runs ~15% slower on Haswell when compiled with -mavx #35528

Extended Description