Closed JonPsson closed 4 years ago
Handled with https://reviews.llvm.org/D73152 (committed as 96ea377).
The reason that the SystemZ backend is slower seems to be because it uses nearly twice as many virtual registers / instructions than when X-compiling to x86 (both on SystemZ). This is due to a lot of IMPLICIT_DEFs and INSERT_SUBREGs, that the SystemZ backend emits (in a seemingly sensible way). Even in the partially reduced test case (tc_creduce_partial.i), this function has ~67000 registers in ~1900 blocks and 875 blocks created during critical edge splitting (SystemZ backend).
PHIElimination suffers from a severe slow down with this huge function. The reason is apparently that when SplitCriticalEdge() updates LiveVariables by calling
addNewBlock(), all the registers are iterated over every single time.
The only way to fully remedy this seems to be to pre-compute the live-in registers for each MBB in an efficient way in PHIElimination.cpp and then pass that information along to LiveVariabless::addNewBlock(). I have made a patch for this: https://reviews.llvm.org/D73152.
(The original idea of using a BitVector in addNewBlock() only removed part of the regression, but was still a good speedup. Since this is reached by other passes I would assume that this improvement would still be desired, or? That would be a separate patch...)
If BitVector is used instead of DenseSet
Not sure why BitVector isn't used already, or if it would be a good idea - does anyone know?
Still not sure why there still is such a slowdown for SystemZ (x10). Comparing to X86, the functions seem to be very similar - the same number of basic blocks and PHI instructions. The structure of the PHIs are also the same - there are no PHI users of PHI definitions for either target...
reduced testcases in compressed tar file PHI nodes elimination seems to be a common cause of compilation slow-down, at least with csmith generated program on SystemZ. I took another such program that takes 15-20s to compile with gcc -O3, but 70-120s with clang -O3 (at least 4 times slower).
I found that 40-45% of the compile time was spent in "Eliminate PHI nodes for register allocation". Interestingly enough, I found that this seems to be specific to the SystemZ backend:
original csmith program:
time clang -O3 -march=z10 crash0.i -c (use -ftime-report to see pass timing) user 1m19.478s
time clang -O3 -march=arch13 crash0.i -c user 1m19.120s
time clang -O3 -target x86_64 crash0.i -c user 0m3.743s
time clang -O3 -target powerpc64 crash0.i -c user 0m6.290s
reduced with creduce (this takes about the same time with SystemZ/X86, but only with SystemZ is PHIElim the main time consumer (25%): clang -O3 -march=z10 tc_creduce_partial.i -c user 0m4.526s
clang -O3 -target x86_64 tc_creduce_partial.i -c user 0m4.416s
Further reduced with bugpoing (incomplete - very slow progress): time llc -O3 -mcpu=z15 tc_bugpoint_partially.bc (use -time-passes to see pass timing) user 0m1,879s Elim PHI: 50%
It seems that we are not consuming compile time in the backend per se but somehow creating many phi nodes in a way that causes a slow-down in common code.
tarball with testcase Sorry, here it is...
Compressed due to the size limit here.
Please attach tc_ctime_elimphi_gvn.i.
assigned to @JonPsson
Extended Description
clang -O3 -march=z13 tc_ctime_elimphi_gvn.i -o a.out -w -ftime-report
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 230.3186 ( 51.8%) 0.1120 ( 16.9%) 230.4306 ( 51.7%) 232.3138 ( 51.7%) Eliminate PHI nodes for register allocation 101.7915 ( 22.9%) 0.1046 ( 15.8%) 101.8961 ( 22.9%) 102.6262 ( 22.8%) Global Value Numbering 17.3571 ( 3.9%) 0.0178 ( 2.7%) 17.3749 ( 3.9%) 17.5102 ( 3.9%) Memory SSA #2 8.7322 ( 2.0%) 0.0129 ( 2.0%) 8.7451 ( 2.0%) 8.8148 ( 2.0%) Greedy Register Allocator 8.4621 ( 1.9%) 0.0887 ( 13.4%) 8.5508 ( 1.9%) 8.6025 ( 1.9%) SystemZ DAG->DAG Pattern Instruction Selection
gcc -O3: 74s clang -O3: 446s
(csmith)