LLVM 15 times slower than gcc compiling the erlang VM-core

llvmbot commented 10 years ago


Bugzilla Link	21185
Version	trunk
OS	Linux
Attachments	Preprocessed input, Input bitcode
Reporter	LLVM Bugzilla Contributor
CC	@hfinkel,@rnk

Extended Description

[This bug report is on LLVM and Clang 3.5 from the git mirror, (LLVM: df820bfd87d7d70d8f1a739333496e191631a3e2; Clang: 870571e52b265a9a471196b1cda93445c9262541;) release-compiled without assertions]

Compiling the core interpreter module of the Erlang VM [1] triggers non-linear behaviour of LLVM's code generator.

The C-module in question is erts/emulator/beam/beam_emu.c in [1], for convenience i have attached the preprocessed input to this bug-report.

Compiling beam_emu.c using clang -O3 takes roughly 45s, compiling it using gcc -O3 takes 3s. Running:

clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu.bc takes 2s

llc -O3 llc -O3 beam_emu-clang.bc takes 43s [bitcode attached]

Which indicates that the slow-down is in the common code generation code.

Using a profile-enabled llc I can see that (60%) of the runtime is spent inside LiveVariables::isLiveOut() as called from PHIElimination::runOnMachineFunction().

For analyzing this bug it may help to know that beam_emu.c implements a directly threaded interpreter. For each VM instruction there is a first-class label followed by code implementing the instruction (much of it automatically generated). Each instruction ends with a goto* which branches to the next VM instruction.

As part of the BEAMJIT-project[2] we extend the VM by adding additional entry points. The resulting three-fold increase in the number of basic blocks increases the run-time to 30min, which may serve as a hint to what is wrong.

[1] http://www.erlang.org/download/otp_src_17.3.tar.gz erts/emulator/beam/beam_emu.c

[2] http://llvm.org/devmtg/2014-04/PDFs/Talks/drejhammar.pdf

hfinkel commented 10 years ago

Please make sure that you send patches to the llvm-commits list, or use reviews.llvm.org (see http://llvm.org/docs/Phabricator.html), for review. Patches posted to bug reports will likely be ignored. When you post for review, please write a description of the problem and the proposed solution (don't just say, for example, "Fix for llvm/llvm-project#21559 "). Please see http://llvm.org/docs/DeveloperPolicy.html for more information. Thanks!

llvmbot commented 10 years ago

LiveVariables::isLiveOut: Avoid expensive sort if possible Fix typo in patch description

llvmbot commented 10 years ago

s/NewPC/PC

llvmbot commented 10 years ago

LiveVariables::isLiveOut: Avoid expensive sort if possible For this example, the attached patch reduces the fraction of the runtime spent in PHI-elimination from 66% to ~20% by optimizing the case in which we have a large number of successor basic blocks, but the kill-set is empty.

For ease of access, I copy the patch description below:

If a variable is defined in a block and last used by a PHI node in its successor the variable will not be in the kill-set nor in the alive set. (See the comment in LiveVariables.h)

Consider an example where we have a large number of basic blocks, for example in a directly threaded byte-code interpreter:

instr0: / Do work / NewPC = PC + 1; goto* code[PC];

instr1: / Do work / NewPC = PC + 1; goto* code[PC];

As code[] can be anything we will get a CFG where each basic block is the successor and predecessor of all basic blocks including itself. The Value representing 'NewPC' variables will not be in the kill-set nor in the alive set. In this case we will waste a lot of time sorting the successor basic blocks. Therefore give up early when we detect that the kill set is empty.

rnk commented 10 years ago

Although the register allocator is a problem, profiling while running this example the dominant part of the execution time (66%) goes to PHI-elimination, 13% to the coalescer and only 4% to register allocation (unless you consider all those as register-allocation). Also the profile does not look at all like the profiles in bugs http://llvm.org/bugs/show_bug.cgi?id=12652 and http://llvm.org/bugs/show_bug.cgi?id=17409

I will attach a .dot-file with the annotated call graph (produced by gprof2dot[1]) for running this input.

[1] https://code.google.com/p/jrfonseca/wiki/Gprof2Dot

Interesting, then this may fixable.

llvmbot commented 10 years ago

Annotated call graph with profiling result Produced by:

gprof llc > dump gprof2dot -fprof dump -o dump.dot

gprof2dot is available at https://code.google.com/p/jrfonseca/wiki/Gprof2Dot

llvmbot commented 10 years ago

Although the register allocator is a problem, profiling while running this example the dominant part of the execution time (66%) goes to PHI-elimination, 13% to the coalescer and only 4% to register allocation (unless you consider all those as register-allocation). Also the profile does not look at all like the profiles in bugs http://llvm.org/bugs/show_bug.cgi?id=12652 and http://llvm.org/bugs/show_bug.cgi?id=17409

I will attach a .dot-file with the annotated call graph (produced by gprof2dot[1]) for running this input.

[1] https://code.google.com/p/jrfonseca/wiki/Gprof2Dot

rnk commented 10 years ago

It is a well-known problem that the register allocator has quadratic complexity w.r.t. the number of basic blocks: http://llvm.org/bugs/show_bug.cgi?id=12652 http://llvm.org/bugs/show_bug.cgi?id=17409

Unfortunately, Jakob, the primary author of the register allocator, isn't working on LLVM these days, so it's unlikely that this will be addressed any time soon. :(

llvmbot commented 10 years ago

[Fix typos in commands to replicate the bug]

clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu-clang.bc

llc -O3 beam_emu-clang.bc

llvm / llvm-project

LLVM 15 times slower than gcc compiling the erlang VM-core #21559

Extended Description