X86 code quality problems highlighted by the TSVC loop benchmarks

hfinkel commented 12 years ago


Bugzilla Link	14076
Version	trunk
OS	All
CC	@lattner,@darkbuck

Extended Description

On the general assumption that running the TSVC loops with floats should be faster than running them with doubles, I suspect we have suboptimal code generation for the following tests:

[from: http://llvm.org/perf/db_default/v4/nts/4826]

MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl 5.3408 MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt 6.6713

MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl 5.4195 MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt 8.6970

Also, while the double-precision code should be slower, as we're not vectorizing, can it legitimately be 2x slower? If not, these also indicate a problem:

MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 7.1848 MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 4.3780

MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 8.3210 MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 4.2134

These also seems questionable:

MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl 8.5494 MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 5.5331

MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl 4.9180 MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt 3.5667

When these tests are run on my POWER7 (powerpc64) machine, the timing on the tests is as I would expect: the float and double versions take approximately the same amount of time to execute, with the double-precision version generally taking slightly more time. As a result, I suspect that these problems are specific to x86 codegen.

llvmbot commented 11 years ago

From my "quick look", it looks like something is off with Reductions-flt, at least icc with vectorizer disabled gets similar performance on Reductions-{flt,dbl}. I don't see anything fishy with Symbolics.

It's probably worth investigating the Reductions-flt performance issue more, so I will leave this open, but my main interest here was in just getting the benchmarks to not be noisy on Sandy Bridge and they have stabilized so I'm not personally planning on doing any more investigation for the time being.

llvmbot commented 11 years ago

Improved here: http://llvm.org/viewvc/llvm-project?view=revision&revision=178968

Here is an LNT run for the change: http://llvm.org/perf/db_default/v4/nts/10308

As you can see, the tests got significantly faster just from changing the data layout which shows that these tests are incredibly susceptible to address layout, at least on Intel.

The following benchmarks are now more in line w.r.t. the dbl/flt ratio (this can be seen from the bigger % improvement on the dbl benchmark):

MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 4.1955 -36.76% MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 2.9747 -11.50%

MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 4.2053 -48.97% MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 3.8445 -8.54%

MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl 4.5808 -46.36% MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 4.2371 -23.84%

There is one test remaining which still seems out of line:

MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl 4.8740 -9.98% MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt 8.1706 -6.02%

And Symbolics still has a high dbl/flt ratio:

MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl 4.5023 - MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt 2.9224 -1.30%

I will take a quick look at Reductions and Symbolics to see if I notice anything fishy.

llvmbot commented 11 years ago

This isn't actually a code gen problem, this is because the benchmarks are written in such a way that they are hitting a specific X86 problem due to the exact address layout of the global variables. This is known as the 4k-aliasing problem on Intel, see http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm

Proposing a patch on the ML to address this and otherwise make the benchmarks a bit more predictable across platforms.

hfinkel commented 12 years ago

What machine are you running this on? Do you have SSE2 enabled?

-Chris

The numbers I quoted are from the Darwin x86_64 builder: http://llvm.org/perf/db_default/v4/nts/4826 (the build options for that builder are just -O3, but I'm pretty sure that enables SSE).

I've seen very similar relative numbers on my x86_64 linux machines, and on those, SSE is certainly enabled.

lattner commented 12 years ago

What machine are you running this on? Do you have SSE2 enabled?

-Chris

llvm / llvm-project