Open llvmbot opened 10 years ago
Please make sure that you send patches to the llvm-commits list, or use reviews.llvm.org (see http://llvm.org/docs/Phabricator.html), for review. Patches posted to bug reports will likely be ignored. When you post for review, please write a description of the problem and the proposed solution (don't just say, for example, "Fix for llvm/llvm-project#21559 "). Please see http://llvm.org/docs/DeveloperPolicy.html for more information. Thanks!
LiveVariables::isLiveOut: Avoid expensive sort if possible Fix typo in patch description
s/NewPC/PC
LiveVariables::isLiveOut: Avoid expensive sort if possible For this example, the attached patch reduces the fraction of the runtime spent in PHI-elimination from 66% to ~20% by optimizing the case in which we have a large number of successor basic blocks, but the kill-set is empty.
For ease of access, I copy the patch description below:
If a variable is defined in a block and last used by a PHI node in its successor the variable will not be in the kill-set nor in the alive set. (See the comment in LiveVariables.h)
Consider an example where we have a large number of basic blocks, for example in a directly threaded byte-code interpreter:
instr0: / Do work / NewPC = PC + 1; goto* code[PC];
instr1: / Do work / NewPC = PC + 1; goto* code[PC];
As code[] can be anything we will get a CFG where each basic block is the successor and predecessor of all basic blocks including itself. The Value representing 'NewPC' variables will not be in the kill-set nor in the alive set. In this case we will waste a lot of time sorting the successor basic blocks. Therefore give up early when we detect that the kill set is empty.
Although the register allocator is a problem, profiling while running this example the dominant part of the execution time (66%) goes to PHI-elimination, 13% to the coalescer and only 4% to register allocation (unless you consider all those as register-allocation). Also the profile does not look at all like the profiles in bugs http://llvm.org/bugs/show_bug.cgi?id=12652 and http://llvm.org/bugs/show_bug.cgi?id=17409
I will attach a .dot-file with the annotated call graph (produced by gprof2dot[1]) for running this input.
Interesting, then this may fixable.
Annotated call graph with profiling result Produced by:
gprof llc > dump gprof2dot -fprof dump -o dump.dot
gprof2dot is available at https://code.google.com/p/jrfonseca/wiki/Gprof2Dot
Although the register allocator is a problem, profiling while running this example the dominant part of the execution time (66%) goes to PHI-elimination, 13% to the coalescer and only 4% to register allocation (unless you consider all those as register-allocation). Also the profile does not look at all like the profiles in bugs http://llvm.org/bugs/show_bug.cgi?id=12652 and http://llvm.org/bugs/show_bug.cgi?id=17409
I will attach a .dot-file with the annotated call graph (produced by gprof2dot[1]) for running this input.
It is a well-known problem that the register allocator has quadratic complexity w.r.t. the number of basic blocks: http://llvm.org/bugs/show_bug.cgi?id=12652 http://llvm.org/bugs/show_bug.cgi?id=17409
Unfortunately, Jakob, the primary author of the register allocator, isn't working on LLVM these days, so it's unlikely that this will be addressed any time soon. :(
[Fix typos in commands to replicate the bug]
clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu-clang.bc
llc -O3 beam_emu-clang.bc
Extended Description
[This bug report is on LLVM and Clang 3.5 from the git mirror, (LLVM: df820bfd87d7d70d8f1a739333496e191631a3e2; Clang: 870571e52b265a9a471196b1cda93445c9262541;) release-compiled without assertions]
Compiling the core interpreter module of the Erlang VM [1] triggers non-linear behaviour of LLVM's code generator.
The C-module in question is erts/emulator/beam/beam_emu.c in [1], for convenience i have attached the preprocessed input to this bug-report.
Compiling beam_emu.c using clang -O3 takes roughly 45s, compiling it using gcc -O3 takes 3s. Running:
clang -O3 -emit-llvm beam_emu-clang.i -o beam_emu.bc takes 2s
llc -O3 llc -O3 beam_emu-clang.bc takes 43s [bitcode attached]
Which indicates that the slow-down is in the common code generation code.
Using a profile-enabled llc I can see that (60%) of the runtime is spent inside LiveVariables::isLiveOut() as called from PHIElimination::runOnMachineFunction().
For analyzing this bug it may help to know that beam_emu.c implements a directly threaded interpreter. For each VM instruction there is a first-class label followed by code implementing the instruction (much of it automatically generated). Each instruction ends with a goto* which branches to the next VM instruction.
As part of the BEAMJIT-project[2] we extend the VM by adding additional entry points. The resulting three-fold increase in the number of basic blocks increases the run-time to 30min, which may serve as a hint to what is wrong.
[1] http://www.erlang.org/download/otp_src_17.3.tar.gz erts/emulator/beam/beam_emu.c
[2] http://llvm.org/devmtg/2014-04/PDFs/Talks/drejhammar.pdf