Open cmtice opened 9 years ago
Sorry. Please ignore the comment and attachment above. mistakenly posted it to the wrong bug.
testcase 1.c Got another testcase smaller.
gperf record -e instructions -c 500000 -o 1.data ~/workarea/llvm-r243659/build/bin/clang -O2 -S pr1.c
InlineSpiller::propagateSiblingValue is the function executing the most instructions.
gperf report -n -i 1.data 8.78% 248 clang-3.8 libLLVMCodeGen.so.3.8.0svn [.] (anonymous namespace)::InlineSpiller::propagateSiblingValue(llvm::DenseMapIterator<llvm::VNInfo, (anon 5.24% 148 clang-3.8 libLLVMSupport.so.3.8.0svn [.] llvm::BranchProbability::scale(unsigned long) const 4.57% 129 clang-3.8 ld-2.19.so [.] do_lookup_x 3.47% 98 clang ld-2.19.so [.] do_lookup_x 2.98% 84 clang-3.8 libLLVMCodeGen.so.3.8.0svn [.] llvm::MachineBlockFrequencyInfo::getBlockFreq(llvm::MachineBasicBlock const) const 2.83% 80 clang-3.8 libLLVMSupport.so.3.8.0svn [.] llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const 2.02% 57 clang-3.8 libLLVMCodeGen.so.3.8.0svn [.] llvm::LiveRange::extendInBlock(llvm::SlotIndex, llvm::SlotIndex)
On further examination, we are not convinced that the assembly here is that bad; however there is still a surprisingly large amount of time being spent in this function and the Block Frequency functions... there seems to be some inefficiency somewhere here that needs further investigation.
More data...Here are the highest perf entries for compiling the file; 3 of the top 4 seem to be related to Branch Probabilities or Block Frequencies (~37% of the compilation time, not looking at the number for propagateSiblingValue):
Extended Description
While recently examining a performance problem in clang (8x slower than GCC, see https://llvm.org/bugs/show_bug.cgi?id=24618), we looked at the results of running 'perf' on clang and saw that in this case the hottest function was llvm::BranchProbabilities::scale (20.69% of the entire compilation was being spent in this function).
Looking more closely at the function's assembly, annotated with perf results we saw:
0.08 │ xor %edx,%edx 0.15 │ imul %rax,%rdi 2.51 │ shr $0x20,%rcx 0.00 │ imul %rax,%rcx 0.93 │ mov %rdi,%rsi 0.45 │ mov %rcx,%rax 0.86 │ shr $0x20,%rsi 0.69 │ shr $0x20,%rax 1.01 │ add %esi,%ecx 0.41 │ mov $0xffffffffffffffff,%rsi 0.26 │ setb %dl 0.55 │ add %edx,%eax 0.85 │ cmp %eax,%r8d │ ↓ ja 50 │49: mov %rsi,%rax 1.33 │ ← retq │ nop 0.93 │50: shl $0x20,%rax 0.33 │ mov %ecx,%ecx │ xor %edx,%edx 0.05 │ or %rcx,%rax 1.00 │ mov $0xffffffff,%r9d 0.27 │ div %r8 32.45 │ cmp %r9,%rax 1.14 │ mov %rax,%rcx 0.74 │ ↑ ja 49 0.98 │ mov %rdx,%rax 0.08 │ mov %edi,%edi 0.03 │ xor %edx,%edx 0.40 │ shl $0x20,%rax 0.94 │ shl $0x20,%rcx 0.03 │ or %rdi,%rax 0.50 │ div %r8 43.53 │ add %rcx,%rax 1.25 │ cmovae %rax,%rsi 2.61 │ ↑ jmp 49
It appears that nearly 75% of the time in this function is being spent on the two 'div' ops. This assembly is very inefficient.. the two div's ought to be done together, thus possibly halving the time spent in this function.
(This is on intel x86_64, BTW, in case it's not obvious from the assembly).
This is with ToT Clang/LLVM, but with:
$ cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=/tmp/llvm-install.opt -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=On
$ make all
$ make install
Attached is a gzip'd version of the .ii file we used. The clang command to compile this file is:
/usr/local/google2/cmtice/llvm-work/llvm-install.opt/bin/clang++ -c -fno-exceptions -Wno-multichar -m64 -Wa,--noexecstack -fPIC -no-canonical-prefixes -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector -DSTDC_FORMAT_MACROS -DSTDC_CONSTANT_MACROS -DANDROID -fmessage-length=0 -W -Wall -Wno-unused -Winit-self -Wpointer-arith -g -fno-strict-aliasing -DNDEBUG -UDEBUG -D__compiler_offsetof=__builtin_offsetof -Werror=int-conversion -Wno-reserved-id-macro -Wno-format-pedantic -Wno-unused-command-line-argument -target x86_64-linux-gnu -DANDROID -fmessage-length=0 -W -Wall -Wno-unused -Winit-self -Wpointer-arith -Wsign-promo -DNDEBUG -UDEBUG -Wno-inconsistent-missing-override -target x86_64-linux-gnu -DBUILDING_LIBART=1 -Wthread-safety -Wthread-safety-negative -Wimplicit-fallthrough -Wfloat-equal -Wint-to-void-pointer-cast -Wused-but-marked-unused -Wdeprecated -Wunreachable-code-break -Wunreachable-code-return -Wmissing-noreturn -fno-omit-frame-pointer -fno-rtti -std=gnu++11 -ggdb3 -Wall -Werror -Wextra -Wstrict-aliasing -fstrict-aliasing -Wunreachable-code -Wredundant-decls -Wshadow -Wunused -fvisibility=protected -DART_DEFAULT_GC_TYPE_IS_CMS -DIMT_SIZE=64 -DART_BASE_ADDRESS=0x60000000 -DART_DEFAULT_INSTRUCTION_SET_FEATURES=default -DART_BASE_ADDRESS_MIN_DELTA=-0x1000000 -DART_BASE_ADDRESS_MAX_DELTA=0x1000000 -DART_DEFAULT_INSTRUCTION_SET_FEATURES="default" -O3 -Wframe-larger-than=2700 -fPIC -D_USING_LIBCXX -std=gnu++14 -nostdinc++ -Werror=int-to-pointer-cast -Werror=pointer-to-int-cast -Werror=address-of-temporary -Werror=null-dereference -Werror=return-type -o interpreter_goto_table_impl.o ./interpreter_goto_table_impl.ii