_mm_dp_ps generating 55% more inefficient instructions

llvmbot commented 11 years ago


Bugzilla Link	14268
Version	trunk
OS	All
Attachments	Simple test case
Reporter	LLVM Bugzilla Contributor
CC	@lattner,@topperc

Extended Description

OS used: Mac OS X Mountain Lion Clang/LLVM Used: 3.2-r167157 Processor: Intel Core i7

I was recently doing some SSE dot product implementations when I noticed a severe performance drop after using _mm_dp_ps. This didn't make sense as the x86 instruction latency manual showed that the dpps instruction is only 12 cycles latency.

Further investigation showed that dpps with a memory operand source: (e.g. dpps $15, (%rex, %rdx), %xmm0) is about 55% slower than issuing a movaps then a dpps with a register source/destination!

I've created and attached a simple test case that does dpps using intrinsics, then using hand-written assembly with register operands, then using hand-written assembly with memory source operand. The intrinsics version issued instructions with memory operands.

Here is how I compiled the test: clang++ -march=native -std=c++11 -stdlib=libc++ -O3 dotps.cpp

Here are the results I got (running each version 100000000 times): DotPsIntrin: 249.332 ms DotPsFast: 159.916 ms DotPsSlow: 249.076 ms

I also tried the intrinsics version with Visual Studio 2012 and the Intel compiler. Both generated the efficient movaps/dpps version even when specifying "Optimize for Space".

llvmbot commented 11 years ago

Which Core i7, penryn or sandybridge?

AFAIK AMD K8-based microarchitecture might have stronger address generator (than Intel's). Not sure Bulldozer.

This is a Sandy Bridge.

llvmbot commented 11 years ago

Which Core i7, penryn or sandybridge?

AFAIK AMD K8-based microarchitecture might have stronger address generator (than Intel's). Not sure Bulldozer.

llvmbot commented 11 years ago

What does gcc do?

It seems that gcc 4.7.2 (Ubuntu 64-bit) is also producing the instruction with a memory operand:

g++ -O3 -std=c++11 -march=native dotps.cpp

movaps (%rdx),%xmm0 dpps $0xff,0x4(%rdi,%rcx,4),%xmm0

I wonder if this slowdown is limited only to dpps - or potentially to other SSE instructions. I also wonder if this happens on AMD processors. Unfortunately I do not have access to any AMD processors that I can test on.

topperc commented 11 years ago

What does gcc do?

RKSimon commented 1 year ago

Its taken 11 years but I've accidently found the reason for this - Sandybridge dpps (but not dppd) folded instructions put an extra uop on Port5 - so the perf regression is almost certainly due to heavy use of Port5.

I'm going to update the Sandybridge model to match this but I'm not sure if we want to bother preventing folding at this point.

RKSimon commented 6 months ago

I've updated all the Intel models that have the extra port usage for DPPS folded instructions - the next step will probably involve #86669 so we have a standard way in which we can unfold instructions based off spare registers and scheduler modeling (and avoid yet another tuning feature flag.......).

llvm / llvm-project

_mm_dp_ps generating 55% more inefficient instructions #14640

Extended Description