llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.52k forks source link

Performance disparity between clang/LLVM and GCC when using libjpeg-turbo #16407

Open llvmbot opened 11 years ago

llvmbot commented 11 years ago
Bugzilla Link 16035
Version 3.2
OS MacOS X
Depends On llvm/llvm-bugzilla-archive#21760
Attachments libjpeg-turbo performance results, Clang/LLVM vs. GCC, OS X 10.8
Reporter LLVM Bugzilla Contributor
CC @lattner

Extended Description

I maintain libjpeg-turbo, a heavily-accelerated fork of libjpeg for x86/x86-64 and ARM systems. A large part of our speedup comes from assembly code, but our Huffman codec relies heavily on C compiler optimizations to achieve peak performance. After upgrading to OS X 10.8, which uses Clang/LLVM as the default compiler rather than GCC, I observed a slowdown of 15-20% when compressing images using libjpeg-turbo, and it seems to be due to the compiler having trouble optimizing said Huffman codec (jchuff.c in the libjpeg-turbo source.) I'll walk you through the steps to reproduce the issue:

NOTE: this is probably reproducible on other platforms, such as Linux, as well. I haven't tested it.

Prerequisites: -- Xcode 4.5.x installed under /Applications/Xcode.app -- nasm, automake, autoconf, and apple-gcc42 from MacPorts installed under /opt/local -- artificial.ppm from http://www.imagecompression.info/test_images/rgb8bit.zip

xcrun svn co svn://svn.code.sf.net/p/libjpeg-turbo/code/trunk libjpeg-turbo cd libjpeg-turbo /opt/local/bin/autoreconf -fiv

mkdir osx.64.clang cd osx.64.clang sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm CC='xcrun clang' CFLAGS=-O4 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.64.llvmgcc cd osx.64.llvmgcc sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm CC='xcrun gcc' CFLAGS=-O3 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.64.gcc42 cd osx.64.gcc42 sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm CC=/opt/local/bin/gcc-apple-4.2 CFLAGS=-O3 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.clang cd osx.32.clang sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm CC='xcrun clang' CFLAGS='-m32 -O4' LDFLAGS=-m32 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.llvmgcc cd osx.32.llvmgcc sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm CC='xcrun gcc' CFLAGS='-m32 -O3' LDFLAGS=-m32 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.gcc42 cd osx.32.gcc42 sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm CC=/opt/local/bin/gcc-apple-4.2 CFLAGS='-O3 -m32' LDFLAGS=-m32 ./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

A spreadsheet of my results and the test image is attached. Note that decompression performance is generally better across the board with Clang/LLVM, but compression performance is generally worse. Note also that, when using the GCC front end to LLVM, the performance is somewhere in the middle, so it seems that part of the issue may be in Clang and part of it may be in LLVM.

If there are things I can do within the inner loops of jchuff.c to make it perform better under Clang/LLVM, I am definitely open to that.

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#21760

llvmbot commented 9 years ago

Confirmed that the issue still exists with Xcode 6.3.2.

llvmbot commented 9 years ago

To clarify the situation, as it stands now:

The original claim -- that GCC is doing a better job at optimizing jchuff.c than clang or LLVM -- is still valid. Nothing has invalidated that claim or the performance numbers reported in the above spreadsheet.

The comments above regarding the SIMD code were a red herring. What was happening was that we were building libjpeg-turbo without SIMD and noticing that the clang-built version was faster than the GCC-built version in that case. However, this was because clang was doing such a good job at optimizing the other algorithms (DCT, color conversion, quantization, etc.) that the speedup from those algorithms masked the slow-down on Huffman encoding. When built with SIMD, then basically the Huffman codec is the only portion of libjpeg-turbo whose performance will depend on compiler optimization.

David has broken down the optimization issue into a smaller example that does not require libjpeg-turbo in order to reproduce, so I recommend moving any discussion to that new bug report and closing this one once the new one is resolved. I have added the new bug as a dependency of this one.

llvmbot commented 9 years ago

I apologize for my last misleading comment on SIMD - I missed that the test program ran in constant time, not constant interruptions, and so when the time dropped I misconstrued the reason.

In any case, this will be my last post on this bug. I created a new bug, bug 21760, which using a single extracted file is able to show specifically with the assembly code why the clang is slower.

llvmbot commented 9 years ago

GOOD NEWS AND BAD NEWS

The past two posts on the encode C function was a red herring. While it is true its the method that consumes most of the CPU during the test, it turns out to not be the problem (my misreading percentage rather than pure time mislead me).

THE BAD

I got a gcc produced binary of tjbench, then used Instruments to Time Profile it and the clang binary. What is totally confounding is that the time slowdown is NOT in any C functions, but in the SIMD routines, all of which are produced by NASM. The disassembled SIMD code looks the same in both apps, and is 32 byte aligned in both. If anyone has any suggestions on what might cause this, I'm all ears.

THE GOOD

When SIMD is disabled, the clang C produced binary runs faster (-O3) than gcc! Way to go team! [I'm going to try -Ofast and -flto when we resolve the SIMD issue).

llvmbot commented 9 years ago

I had to use the debug config to get anything that made sense. Here is one of those small code blocks - there is at least one register spill:

Ltmp11: movl -92(%rbp), %eax subl $8, %eax movl %eax, -92(%rbp) .loc 1 71 0 ## /Volumes/Data/git/libjpeg-turbo-builder/tjbench/tester.c:71:0 movq -88(%rbp), %rcx movl -92(%rbp), %eax movl %eax, %edx movq %rcx, -120(%rbp) ## 8-byte Spill movq %rdx, %rcx

kill: CL RCX

movq    -120(%rbp), %rdx        ## 8-byte Reload
shrq    %cl, %rdx
movb    %dl, %cl
movb    %cl, -101(%rbp)
.loc    1 72 0                  ## /Volumes/Data/git/libjpeg-turbo-builder/tjbench/tester.c:72:0
movb    -101(%rbp), %cl
movq    -80(%rbp), %rdx
movq    %rdx, %rsi
addq    $1, %rsi
movq    %rsi, -80(%rbp)
movb    %cl, (%rdx)
.loc    1 73 0                  ## /Volumes/Data/git/libjpeg-turbo-builder/tjbench/tester.c:73:0

Ltmp12: movzbl -101(%rbp), %eax cmpl $255, %eax jne LBB0_7

BB#6: ## in Loop: Header=BB0_3 Depth=1

movq    -80(%rbp), %rax
movq    %rax, %rcx
addq    $1, %rcx
movq    %rcx, -80(%rbp)
movb    $0, (%rax)

Ltmp13: LBB0_7: ## in Loop: Header=BB0_3 Depth=1

llvmbot commented 9 years ago

It irks me that gcc is faster than clang, so I sort of poked around with his code. I actually was able to create an Xcode project with both the library source and his benchmarking app. On image compression, the vast majority of his time is spent in one routine, and that routine has a MACRO that expands into the following, and 63 times (see below).

In that MACRO (expanded below), you can see the 10 code blocks with a local 'c' variable. I wonder if it would be better to make c a short (or unsigned short), so the comparison could be done without promotion. Do you think the use of blocks, each having their own local 'c' is hurting performance?

I tried to assemble a short file with this code in it, but Xcode not cooperating right now.

typedef short JCOEF; typedef unsigned char JOCTET; typedef JCOEF *JCOEFPTR;

{ int temp, temp2, temp3; int nbits; int r, code, size; JOCTET *buffer; size_t put_buffer; int put_bits; int code_0xf0 = actbl->ehufco[0xf0], size_0xf0 = actbl->ehufsi[0xf0];

if ((temp = block[1]) == 0) { r++; } else { temp2 = temp; temp3 = temp >> (8 sizeof(int) - 1); temp ^= temp3; temp -= temp3; temp2 += temp3; nbits = (jpeg_nbits_table[temp]); while (r > 15) { { { if (put_bits > 47) { { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } } } { put_bits += size_0xf0; put_buffer = (put_buffer << size_0xf0) | code_0xf0; } } r -= 16; } temp3 = (r << 4) + nbits; code = actbl->ehufco[temp3]; size = actbl->ehufsi[temp3]; { temp2 &= (((INT32) 1)<<nbits) - 1; { if (put_bits > 31) { { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) buffer++ = 0; } { JOCTET c; put_bits -= 8; c = (JOCTET)(put_buffer >> put_bits); buffer++ = c; if (c == 0xFF) *buffer++ = 0; } } } { put_bits += size; put_buffer = (put_buffer << size) | code; } { put_bits += nbits; put_buffer = (put_buffer << nbits) | temp2; } } r = 0; } } return 1; }

llvmbot commented 10 years ago

I can confirm that this is still an issue in Xcode 6 beta 4. For the libjpeg-turbo compression code, Xcode 6 is still running at about an 8-13% deficit for 32-bit and an 11-17% deficit for 64-bit when compared to GCC 4.2.1. Relative to Xcode 4.6.x, 64-bit code has improved a few percent, but 32-bit code has regressed a few percent.

As previously indicated, the disparity is likely due to how well or how poorly the compiler can optimize the Huffman encoder in jchuff.c. If there is something I can do to those inner loops to make them run faster with clang's optimizer, I'm happy to do so, but I wouldn't know where to start.

llvmbot commented 10 years ago

libjpeg-turbo performance results, Clang/LLVM vs. GCC, OS X 10.8

llvmbot commented 10 years ago

Any updates on this?

llvmbot commented 10 years ago

I would be very interested in progress on this bug, the problem becomes only bigger as Apple pulls the plug on gcc further.

I can confirm that there seem to be performance problems with clang and libjpeg turbo. Comparing a libjpeg turbo build created with XCode 4.5 arm-apple-darwin10-llvm-gcc-4.2 and another one created with XCode 5.0 clang reveals a 20% performance closs when switching from arm-apple-darwin10-llvm-gcc-4.2 to clang. I tried, to the best of my knowledge, use the same/similar flags. They are:

arm-apple-darwin10-llvm-gcc-4.2 -DHAVE_CONFIG_H -I. -I.. -Wall -mfloat-abi=softfp -isysroot /Volumes/Xcode//Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS6.0.sdk -O3 -march=armv7s -mcpu=swift -f=swift -mfpu=neon -MT jcparam.lo -MD -MP -MF .deps/jcparam.Tpo -c -o jcparam.lo ../jcparam.c libtool: compile: /Volumes/Xcode//Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/arm-apple-darwin10-llvm-gcc-4.2 -DHAVE_CONFIG_H -I. -I.. -Wall -mfloat-abi=softfp -isysroot /Volumes/Xcode//Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS6.0.sdk -O3 -march=armv7s -mcpu=swift -mtune=swift -mfpu=neon -MT jcparam.lo -MD -MP -MF .deps/jcparam.Tpo -c ../jcparam.c -o jcparam.o

clang -x c -I external/jpeg-turbo -isystem /Users/frankfa/repo/nativesdk/build/core/include -c -arch armv7s -miphoneos-version-min=5.0 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS7.0.sdk -I/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include -DIOS -UDEBUG -DNDEBUG -fomit-frame-pointer -fno-strict-aliasing -DJPEG_LIB_VERSION=62 -mcpu=swift -mtune=swift -mfpu=neon -mfloat-abi=softfp -D__ARM_NEON__ -O3 -fstrict-aliasing -fPIC -DPIC -MD -MF out/target/ios-armv7s/intermediate/libyahoo_jpegturbo/jcparam.d -o out/target/ios-armv7s/intermediate/libyahoo_jpegturbo/jcparam.o external/jpeg-turbo/jcparam.c

Does this look like a further possible regression, or might I have some mistake/misunderstanding in the way I call clang vs the gcc clang front-end.

lattner commented 11 years ago

Ok, thanks!

llvmbot commented 11 years ago

Actually, the results in the spreadsheet were from XCode 4.6.1. Sorry for the confusion. I added a new version of the spreadsheet with XCode 4.5.2 results as well. In fact, the results from XCode 4.5.2 are generally better, so this appears to be at least partly a regression.

llvmbot commented 11 years ago

libjpeg-turbo performance results, Clang/LLVM vs. GCC, OS X 10.8 Added XC 4.5.2 results

lattner commented 11 years ago

Random question: have you tried Xcode 4.6? The compiler delta between 4.5 and 4.6 was pretty big.

llvmbot commented 11 years ago

xref rdar://13912904

llvmbot commented 11 years ago

Image was too big to attach. You can grab it here: http://www.libjpeg-turbo.org/artificial.ppm.bz2