Open llvmbot opened 11 years ago
Which Core i7, penryn or sandybridge?
AFAIK AMD K8-based microarchitecture might have stronger address generator (than Intel's). Not sure Bulldozer.
This is a Sandy Bridge.
Which Core i7, penryn or sandybridge?
AFAIK AMD K8-based microarchitecture might have stronger address generator (than Intel's). Not sure Bulldozer.
What does gcc do?
It seems that gcc 4.7.2 (Ubuntu 64-bit) is also producing the instruction with a memory operand:
g++ -O3 -std=c++11 -march=native dotps.cpp
movaps (%rdx),%xmm0 dpps $0xff,0x4(%rdi,%rcx,4),%xmm0
I wonder if this slowdown is limited only to dpps - or potentially to other SSE instructions. I also wonder if this happens on AMD processors. Unfortunately I do not have access to any AMD processors that I can test on.
What does gcc do?
Its taken 11 years but I've accidently found the reason for this - Sandybridge dpps (but not dppd) folded instructions put an extra uop on Port5 - so the perf regression is almost certainly due to heavy use of Port5.
I'm going to update the Sandybridge model to match this but I'm not sure if we want to bother preventing folding at this point.
I've updated all the Intel models that have the extra port usage for DPPS folded instructions - the next step will probably involve #86669 so we have a standard way in which we can unfold instructions based off spare registers and scheduler modeling (and avoid yet another tuning feature flag.......).
Extended Description
OS used: Mac OS X Mountain Lion Clang/LLVM Used: 3.2-r167157 Processor: Intel Core i7
I was recently doing some SSE dot product implementations when I noticed a severe performance drop after using _mm_dp_ps. This didn't make sense as the x86 instruction latency manual showed that the dpps instruction is only 12 cycles latency.
Further investigation showed that dpps with a memory operand source: (e.g. dpps $15, (%rex, %rdx), %xmm0) is about 55% slower than issuing a movaps then a dpps with a register source/destination!
I've created and attached a simple test case that does dpps using intrinsics, then using hand-written assembly with register operands, then using hand-written assembly with memory source operand. The intrinsics version issued instructions with memory operands.
Here is how I compiled the test: clang++ -march=native -std=c++11 -stdlib=libc++ -O3 dotps.cpp
Here are the results I got (running each version 100000000 times): DotPsIntrin: 249.332 ms DotPsFast: 159.916 ms DotPsSlow: 249.076 ms
I also tried the intrinsics version with Visual Studio 2012 and the Intel compiler. Both generated the efficient movaps/dpps version even when specifying "Optimize for Space".