llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.62k stars 11.83k forks source link

[Compile Time] 50% of time spent in Elim phi nodes #43594

Closed JonPsson closed 4 years ago

JonPsson commented 4 years ago
Bugzilla Link 44249
Resolution FIXED
Resolved on Feb 05, 2020 15:17
Version trunk
OS Linux
CC @davidbolvansky,@JonPsson,@qcolombet,@uweigand,@yuanfang-chen

Extended Description

clang -O3 -march=z13 tc_ctime_elimphi_gvn.i -o a.out -w -ftime-report

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 230.3186 ( 51.8%) 0.1120 ( 16.9%) 230.4306 ( 51.7%) 232.3138 ( 51.7%) Eliminate PHI nodes for register allocation 101.7915 ( 22.9%) 0.1046 ( 15.8%) 101.8961 ( 22.9%) 102.6262 ( 22.8%) Global Value Numbering 17.3571 ( 3.9%) 0.0178 ( 2.7%) 17.3749 ( 3.9%) 17.5102 ( 3.9%) Memory SSA #​2 8.7322 ( 2.0%) 0.0129 ( 2.0%) 8.7451 ( 2.0%) 8.8148 ( 2.0%) Greedy Register Allocator 8.4621 ( 1.9%) 0.0887 ( 13.4%) 8.5508 ( 1.9%) 8.6025 ( 1.9%) SystemZ DAG->DAG Pattern Instruction Selection

gcc -O3: 74s clang -O3: 446s

(csmith)

JonPsson commented 4 years ago

Handled with https://reviews.llvm.org/D73152 (committed as 96ea377).

JonPsson commented 4 years ago

The reason that the SystemZ backend is slower seems to be because it uses nearly twice as many virtual registers / instructions than when X-compiling to x86 (both on SystemZ). This is due to a lot of IMPLICIT_DEFs and INSERT_SUBREGs, that the SystemZ backend emits (in a seemingly sensible way). Even in the partially reduced test case (tc_creduce_partial.i), this function has ~67000 registers in ~1900 blocks and 875 blocks created during critical edge splitting (SystemZ backend).

PHIElimination suffers from a severe slow down with this huge function. The reason is apparently that when SplitCriticalEdge() updates LiveVariables by calling
addNewBlock(), all the registers are iterated over every single time.

The only way to fully remedy this seems to be to pre-compute the live-in registers for each MBB in an efficient way in PHIElimination.cpp and then pass that information along to LiveVariabless::addNewBlock(). I have made a patch for this: https://reviews.llvm.org/D73152.

(The original idea of using a BitVector in addNewBlock() only removed part of the regression, but was still a good speedup. Since this is reached by other passes I would assume that this improvement would still be desired, or? That would be a separate patch...)

JonPsson commented 4 years ago

If BitVector is used instead of DenseSet for the Defs and Kills sets in LiveVariables::addNewBlock(), I see unchanged compile time on X86 and PPC, while on SystemZ improves from 80 seconds to 56 seconds.

Not sure why BitVector isn't used already, or if it would be a good idea - does anyone know?

Still not sure why there still is such a slowdown for SystemZ (x10). Comparing to X86, the functions seem to be very similar - the same number of basic blocks and PHI instructions. The structure of the PHIs are also the same - there are no PHI users of PHI definitions for either target...

JonPsson commented 4 years ago

reduced testcases in compressed tar file PHI nodes elimination seems to be a common cause of compilation slow-down, at least with csmith generated program on SystemZ. I took another such program that takes 15-20s to compile with gcc -O3, but 70-120s with clang -O3 (at least 4 times slower).

I found that 40-45% of the compile time was spent in "Eliminate PHI nodes for register allocation". Interestingly enough, I found that this seems to be specific to the SystemZ backend:

original csmith program:

time clang -O3 -march=z10 crash0.i -c (use -ftime-report to see pass timing) user 1m19.478s

time clang -O3 -march=arch13 crash0.i -c user 1m19.120s

time clang -O3 -target x86_64 crash0.i -c user 0m3.743s

time clang -O3 -target powerpc64 crash0.i -c user 0m6.290s

reduced with creduce (this takes about the same time with SystemZ/X86, but only with SystemZ is PHIElim the main time consumer (25%): clang -O3 -march=z10 tc_creduce_partial.i -c user 0m4.526s

clang -O3 -target x86_64 tc_creduce_partial.i -c user 0m4.416s

Further reduced with bugpoing (incomplete - very slow progress): time llc -O3 -mcpu=z15 tc_bugpoint_partially.bc (use -time-passes to see pass timing) user 0m1,879s Elim PHI: 50%

It seems that we are not consuming compile time in the backend per se but somehow creating many phi nodes in a way that causes a slow-down in common code.

JonPsson commented 4 years ago

tarball with testcase Sorry, here it is...

Compressed due to the size limit here.

davidbolvansky commented 4 years ago

Please attach tc_ctime_elimphi_gvn.i.

JonPsson commented 4 years ago

assigned to @JonPsson