llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.63k stars 11.83k forks source link

tail duplication changes negatively impacted Clang runtime performance #10575

Open llvmbot opened 13 years ago

llvmbot commented 13 years ago
Bugzilla Link 10203
Version trunk
OS All
Reporter LLVM Bugzilla Contributor

Extended Description

I happened to notice that the code generation changes in: http://llvm.org/viewvc/llvm-project?view=rev&revision=133682 had a negative impact on the runtime performance of a self-hosted Clang executable, by a fairly significant margin (1-2%).

I don't have more detailed information on the exact piece of code where the regression is yet, but here is the data I have on the regression:

My test scenario uses flops-8.c from nightly test, although I suspect the exact input isn't very important, it is just one where we noticed the regression: http://llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Misc/flops-8.c?revision=47963&content-type=text%2Fplain

My test script uses a simple timing tool 'runN' and runs the frontend on the same input many times to decrease sampling error. You can replace 'runN 10' with 'time':

ddunbar@smoosh-17:9670555$ cat test.sh

!/bin/sh

runN 10 $1 \ "-cc1" "-triple" "x86_64-apple-macosx10.6.7" "-emit-obj" "-mrelax-all" "-disable-free" "-main-file-name" "flops-8.c" "-pic-level" "1" \ "-mdisable-fp-elim" "-masm-verbose" "-munwind-tables" "-target-cpu" "core2" "-target-linker-version" "123.4" "-g" \ "-coverage-file" "flops-8.o" "-resource-dir" "/tmp" \ "-O0" "-ferror-limit" "19" "-fmessage-length" "90" "-stack-protector" "1" "-fblocks" "-fdiagnostics-show-option" "-o" "flops-8.o" "-x" "c" \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c \ flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c flops-8.c

The following test samples show the regression. Here I am using a fixed LLVM/Clang revision (r133682) built and installed in the normal configure/make/make install fashion using the following configure arguments:

~/llvm.ref/src/configure \ --prefix=$idir \ --disable-bindings --with-built-clang --with-llvmcc=clang \ --without-llvmgcc --without-llvmgxx \ --enable-optimized \ CC=$HOSTDIR/usr/bin/clang \ CXX=$HOSTDIR/usr/bin/clang++ &> $dir/configure.log

and using "Apple Clang" compilers as the host compilers, from revisions r133679 and r133682. The Apple Clang compilers just differ in the exact way they are built, they should be functionally identical to configure/make style compilers at the same revisions.

Here is the timing data on a MacPro 5,1:

ddunbar@smoosh-17:9670555$ ./test.sh custom_builds/clang-r133682__apple-clang-x86_64-darwin10-R__clang-r133679-t20110622_190133-b5349.install/bin/clang name avg min med max SD total user 0.8674 0.8667 0.8675 0.8680 0.0004 8.6739 sys 0.0555 0.0544 0.0551 0.0546 0.0011 0.5550 wall 0.9298 0.9331 0.9290 0.9295 0.0016 9.2978

ddunbar@smoosh-17:9670555$ ./test.sh custom_builds/clang-r133682__apple-clang-x86_64-darwin10-R__clang-r133682-t20110622_204833-b5350.install/bin/clang name avg min med max SD total user 0.8874 0.8865 0.8875 0.8888 0.0007 8.8741 sys 0.0555 0.0562 0.0552 0.0541 0.0008 0.5550 wall 0.9706 0.9483 0.9491 0.9506 0.0542 9.7059

Compare min user times...

ddunbar@smoosh-17:9670555$ python -c 'print .8865 / .8667' 1.02284527518

llvmbot commented 13 years ago

test patch I tried this on an old mac pro 1,1 and I still cannot reproduce this.

There have been some changes in this area. Can you check if this bug is still present? The attached patch will disable the logic added in 133682. Let me know if it gives you a performance improvement.

llvmbot commented 13 years ago

Sorry, I am having a hard time reproducing this :-(

I first thought I had it on Linux, but there was a lot of noise in the runs on my workstation.

I have just tried it on my home desktop by building clang 133682 and 133681. I checked that I had the correct revisions by checking that 133682 fixed the time regression on matrix.cpp. I then compiled clang again twice, once with the 133682 and once with 133681. Both of these perform the same on flops-8.c.

I even added more to the command line so that it takes more than one second, but they results are still the same.

The machine is an iMac11,1 4 cores 2.8 GHz.

I did all the tests on 64 bits. Are you using 32?