Open hfinkel opened 12 years ago
From my "quick look", it looks like something is off with Reductions-flt, at least icc with vectorizer disabled gets similar performance on Reductions-{flt,dbl}. I don't see anything fishy with Symbolics.
It's probably worth investigating the Reductions-flt performance issue more, so I will leave this open, but my main interest here was in just getting the benchmarks to not be noisy on Sandy Bridge and they have stabilized so I'm not personally planning on doing any more investigation for the time being.
Improved here: http://llvm.org/viewvc/llvm-project?view=revision&revision=178968
Here is an LNT run for the change: http://llvm.org/perf/db_default/v4/nts/10308
As you can see, the tests got significantly faster just from changing the data layout which shows that these tests are incredibly susceptible to address layout, at least on Intel.
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 4.1955 -36.76% MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 2.9747 -11.50%
MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 4.2053 -48.97% MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 3.8445 -8.54%
I will take a quick look at Reductions and Symbolics to see if I notice anything fishy.
This isn't actually a code gen problem, this is because the benchmarks are written in such a way that they are hitting a specific X86 problem due to the exact address layout of the global variables. This is known as the 4k-aliasing problem on Intel, see http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm
Proposing a patch on the ML to address this and otherwise make the benchmarks a bit more predictable across platforms.
What machine are you running this on? Do you have SSE2 enabled?
-Chris
The numbers I quoted are from the Darwin x86_64 builder: http://llvm.org/perf/db_default/v4/nts/4826 (the build options for that builder are just -O3, but I'm pretty sure that enables SSE).
I've seen very similar relative numbers on my x86_64 linux machines, and on those, SSE is certainly enabled.
What machine are you running this on? Do you have SSE2 enabled?
-Chris
Extended Description
On the general assumption that running the TSVC loops with floats should be faster than running them with doubles, I suspect we have suboptimal code generation for the following tests:
[from: http://llvm.org/perf/db_default/v4/nts/4826]
MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl 5.3408 MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt 6.6713
MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl 5.4195 MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt 8.6970
Also, while the double-precision code should be slower, as we're not vectorizing, can it legitimately be 2x slower? If not, these also indicate a problem:
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 7.1848 MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 4.3780
MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 8.3210 MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 4.2134
These also seems questionable:
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl 8.5494 MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 5.5331
MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl 4.9180 MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt 3.5667
When these tests are run on my POWER7 (powerpc64) machine, the timing on the tests is as I would expect: the float and double versions take approximately the same amount of time to execute, with the double-precision version generally taking slightly more time. As a result, I suspect that these problems are specific to x86 codegen.