clang compiled c-ray 1.1 benchmarks up to 43% slower than gcc compiled versions

llvmbot commented 9 years ago


Bugzilla Link	22657
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@chandlerc,@davidbolvansky,@vns-mn,@fhahn,@hfinkel,@LebedevRI,@RKSimon,@slacka,@rotateright,@yuanfang-chen

Extended Description

The c-ray 1.1 benchmark, from http://www.phoronix-test-suite.com/benchmark-files/c-ray-1.1.tar.gz, exhibits up to 43% slower performance when compiled under clang compared to when compiled with gcc. Using "-O3 -march=native" on a Nehalem processor, the following is observed...

clang 3.5.1 % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (50 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (865 milliseconds) c-ray-mt v1.1 Rendering took: 11 seconds (11254 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2660 milliseconds)

clang 3.6-rc4 % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (52 milliseconds) c-ray-mt v1.1 Rendering took: 1 seconds (1054 milliseconds) c-ray-mt v1.1 Rendering took: 13 seconds (13556 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2850 milliseconds)

clang 3.7svn (r230135) % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (51 milliseconds) c-ray-mt v1.1 Rendering took: 1 seconds (1051 milliseconds) c-ray-mt v1.1 Rendering took: 13 seconds (13783 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2863 milliseconds)

gcc 4.9.2 % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (43 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (689 milliseconds) c-ray-mt v1.1 Rendering took: 8 seconds (8786 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2478 milliseconds)

gcc 5.0svn (r220888) % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (45 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (667 milliseconds) c-ray-mt v1.1 Rendering took: 8 seconds (8727 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2390 milliseconds)

davidbolvansky commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#42968

RKSimon commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#31455

david-xl commented 3 years ago

ray_sphere does get inlined into shade but not trace.

LLVM tracks SROA benefit and byval arg cost but still unable to keep the cost down the threshold for 'trace'.

The latest LLVM with PGO should be able to handle it with the better benefit/cost analysis.

davidbolvansky commented 3 years ago

Hmm, a lot.

LLVM currently does not analyze benefits created by inlining in the caller context yet (it only looks at savings in the callee with call context propagated). It is some area to be improved upon. (noted by Xinliang David Li) in D93838.

f1376df8-34bc-4756-9be6-f8bc6a69b887 commented 3 years ago

In c-ray 1.1, gcc 11 is now 94% faster than clang 12. https://www.phoronix.com/scan.php?page=article&item=clang12-gcc11-icelake&num=4

davidbolvansky commented 4 years ago

Probably the only way how to improve c-ray is to make inliner smarter.

llvmbot commented 4 years ago

Phoronix benchmarked gcc 9.2.1 and LLVM Clang 9.0 on Icelake and the c-ray benchmark (and the total linux kernel compilation time) are the only sore spots.

https://www.phoronix.com/scan.php?page=news_item&px=GCC-LLVM-Clang-Icelake-Tests

davidbolvansky commented 5 years ago

Tested on Intel Haswell.

'make' - GCC 9 ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (518 milliseconds)

'make' - Clang 9 ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (609 milliseconds)

Added __attribute((always_inline)) to ray_sphere: GCC 9: ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (514 milliseconds)

./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (413 milliseconds)

As we can see, a significant win: 609 -> 413 milliseconds

There is also some haswell codegen issue, since with -march=haswell: Clang 9 ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (403 milliseconds)

GCC 9 ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (371 milliseconds)

ICC with march=haswell and always_inline ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (344 milliseconds)

ICC without always_inline and without march=haswell ./c-ray-mt -t 8 -s 800x400 -r 1 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 0 seconds (478 milliseconds)

-Rpass-missed=inline -mllvm -inline-threshold=400"

c-ray-mt.c:361:9: remark: shade not inlined into render_scanline because too costly to inline (cost=1000, threshold=1000) [-Rpass-missed=inline] col = shade(nearest_obj, &nearest_sp, depth); ^ c-ray-mt.c:317:28: remark: get_primary_ray not inlined into render_scanline because too costly to inline (cost=455, threshold=400) [-Rpass-missed=inline]

BTW, this seems like a.. bug? why "(cost=1000, threshold=1000)" ? I expected

"(cost=1000, threshold=400)" ..

davidbolvansky commented 5 years ago

Bug llvm/llvm-bugzilla-archive#42968 has been marked as a duplicate of this bug.

llvmbot commented 5 years ago

The work-around of passing -mllvm -inline-threshold=500 no longer recovers the missing in-lining in current clang 3.9svn. The workaround works fine with the llvm/clang 3.8.0 release.

Adding attribute((always_inline)) to:

int ray_sphere(const struct sphere sph, struct ray ray, struct spoint sp);

fixes the issue with LLVM 4.0, but using -fprofile-generate/-fprofile-use doesn't. ray_sphere is called from the trace and shade functions, with the shade function being the hottest.

The problem is shown again here: https://www.phoronix.com/scan.php?page=article&item=gcc9-clang8-aarch64&num=3

The script for the benchmark runs there doesn't use profile data: https://openbenchmarking.org/innhold/91db3ffff901d12dabd732bc568a44d02e5c6387

llvmbot commented 8 years ago

The work-around of passing -mllvm -inline-threshold=500 no longer recovers the missing in-lining in current clang 3.9svn. The workaround works fine with the llvm/clang 3.8.0 release.

llvmbot commented 8 years ago

Clang 3.8 branch still doesn't show any improvement on this issue.

Clang 3.8svn with -O3 -march=native

$ ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (58 milliseconds) c-ray-mt v1.1 Rendering took: 1 seconds (1099 milliseconds) c-ray-mt v1.1 Rendering took: 14 seconds (14815 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2917 milliseconds)

Clang 3.8svn with -O3 -march=native -mllvm -inline-threshold=500

$ ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (52 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (737 milliseconds) c-ray-mt v1.1 Rendering took: 9 seconds (9976 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2510 milliseconds)

GCC 5.3 with -O3 -march=native

$ ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (46 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (679 milliseconds) c-ray-mt v1.1 Rendering took: 9 seconds (9300 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2455 milliseconds)

llvmbot commented 9 years ago

FYI, it appears that FSF gcc got all of their inlining improvements for the c-ray benchmark in place by 4.7...

http://www.phoronix.com/scan.php?page=article&item=gcc_49_pentium&num=3

That should have been by 4.8.

llvmbot commented 9 years ago

FYI, it appears that FSF gcc got all of their inlining improvements for the c-ray benchmark in place by 4.7...

http://www.phoronix.com/scan.php?page=article&item=gcc_49_pentium&num=3

hfinkel commented 9 years ago

Chandler, you might want to look at this.

llvmbot commented 9 years ago

It should be said that c-ray is mostly a benchmark for the inliner, yanking up the inlining threshold or making the inline cost computation smarter should make it much faster.

Certainly appears to be the case...

% make clang-3.7 -O3 -march=native -mllvm -inline-threshold=500 -c -o c-ray-mt.o c-ray-mt.c clang-3.7 -o c-ray-mt c-ray-mt.o -lm -lpthread % ./RUN.full c-ray-mt v1.1 Rendering took: 0 seconds (44 milliseconds) c-ray-mt v1.1 Rendering took: 0 seconds (691 milliseconds) c-ray-mt v1.1 Rendering took: 8 seconds (8970 milliseconds) c-ray-mt v1.1 Rendering took: 2 seconds (2294 milliseconds)

llvmbot commented 9 years ago

preprocessed source for c-ray-mt.c compiled with clang 3.7svn using -O3 -march=native

llvmbot commented 9 years ago

preprocessed source for c-ray-mt.c compiled with gcc 5.0svn using -O3 -march=native

llvmbot commented 9 years ago

assembly for c-ray-mt.c compiled with gcc 5.0svn using -O3 -march=native

llvmbot commented 9 years ago

assembly for c-ray-mt.c compiled with clang 3.7svn using -O3 -march=native

d0k commented 9 years ago

It should be said that c-ray is mostly a benchmark for the inliner, yanking up the inlining threshold or making the inline cost computation smarter should make it much faster.

llvmbot commented 9 years ago

The two c-ray-mt runs in the RUN.full benchmark script which exhibit the large 43% performance gap, for the clang compiled binaries compared to the gcc compiled ones, are...

cat sphfract | ./c-ray-mt -t 32 > foo.ppm cat sphfract | ./c-ray-mt -t 32 -s 1024x768 -r 8 > foo.ppm

llvm / llvm-project

clang compiled c-ray 1.1 benchmarks up to 43% slower than gcc compiled versions #23031

Extended Description