llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.01k stars 11.96k forks source link

Runaway inlining with new pass manager. #57006

Open rschatz opened 2 years ago

rschatz commented 2 years ago

I'm looking at the code generated for yaml_parser_fetch_more_tokens in https://github.com/oracle/truffleruby/blob/4054469bbd985db7dc5c86299004ff86a2835baf/src/main/c/psych/yaml/scanner.c (to reproduce, you just need this file, yaml.h and yaml_private.h next to it).

I'm compiling the file with clang -S -emit-llvm -I. -O1 scanner.c (the -O level doesn't really matter, as long as it's not -O0).

Looking at the resulting bitcode of yaml_parser_fetch_more_tokens, it's huge, ~3500 bitcode instructions. Also, the first basic block consists of just 60 alloca, and then a mix of 700 bitcast and getelementptr instructions. Many of them are used later, all over the rest of the function. As far as I can tell, the bitcode is correct, just a lot bigger than expected.

Compiling the same function with -flegacy-pass-manager -O1, it looks totally fine, just ~40 bitcode instructions. Even with -flegacy-pass-manager -O3 it’s only sligthly bigger, ~70 bitcode instructions.

Running with -mllvm --print-changed=quiet -mllvm --filter-print-funcs=yaml_parser_fetch_more_tokens shows that it is the inliner that's blowing up (thanks @aeubanks for pointing out these flags on discourse).

Even if this is just the known "doing some inlining at -O1" difference in the new pass manager, I think this is way too aggressive, even for -O3, let alone -O1. Even if the code size is considered ok, I don't think it makes sense to have that many GEPs in the first block, because that just leads to a lot of spilling.

I did my tests with LLVM 14.0.3, plus various different LLVM revisions. I actually bisected the issue to the commit enabling the new pass manager by default, so the bug seems to be in the new pass manager for at least as long as it's enabled by default. Everything on linux/x86_64 in case it makes a difference.

aeubanks commented 2 years ago

I looked into this, as far as I can tell it looks like the inlining is expected.

clang -Xclang -disable-llvm-passes gives the unoptimized IR. Running opt -O1 -S /tmp/a.ll --filter-print-funcs=yaml_parser_fetch_more_tokens --print-before=inline -print-module-scope gives the IR right before we run inlining on that function.

Running opt -passes=inline -debug with the inliner modified to only touch yaml_parser_fetch_more_tokens shows a lot of very negative costs, around the cost of the "last call to static bonus". Meaning that all of these calls are to static functions that have no other callers, so there's basically no reason not to inline. Once they're inlined, the callee is deleted. The total number of lines in the module right after inlining is just a bit smaller than right before, supporting this theory.

pogo59 commented 2 years ago

so there's basically no reason not to inline

Well, that's kind of overstating things. Inlining can blow up the stack usage, either naively or because it causes huge register pressure; it can make the caller so huge it takes forever to compile; ... there are reasons not to inline a static callee into the only caller.

hiraditya commented 2 years ago

if possible, you can add __attribute__((noinline)) to disable inlining of a function.

dwblaikie commented 2 years ago

@rschatz other than the instruction count for the function, are there more end-result metrics (later compile time, total execution time, stack size, etc) that cause this function/optimization choice to be problematic for your situation? (how'd you end up investigating this particular situation/what brought it to your attention?)

rschatz commented 2 years ago

Yes, the main thing I'm concerned about is compile time.

The reason why I started looking into this is a warmup regression in GraalVM, caused by the update from LLVM 12 to LLVM 14 some time ago. For context: What we're doing in GraalVM is compiling C code to LLVM bitcode ahead of time, and then using that LLVM bitcode as starting point for JIT compilation at runtime. The idea is that our compiler shouldn't have much left to do at least for that part of the code that is LLVM bitcode (there is other code, in this example Ruby, where the compiler actually has to do more). But the yaml_parser_fetch_more_tokens function still has a compile time of ~8 seconds.

I'm also seeing a similar warmup regression in the stockfish benchmark. I haven't dug very deep into it, since the codebase is much bigger than the yaml example, but I'm assuming it's the same thing, in particular since it's also "fixed" by using the legacy pass manager.

To be honest, I'm not sure if the inliner is really the root cause of this problem. Even with everything inlined, the function looks too big. What I tried just now is to put __attribute__((noinline)) on the yaml_parser_fetch_next_token function. This changes quite a lot. Obviously it's not inlining yaml_parser_fetch_next_token anymore, making the yaml_parser_fetch_more_tokens function reasonable. But even the yaml_parser_fetch_next_token function looks kind of reasonable, it has only 1600 bitcodes, only 21 alloca in the beginning, and no huge amount of GEPs and bitcasts.

I tried all of that with -O1 and -O3, and no big difference there.

So I guess what I don't get is, why does inlining of a single call of a 1600 bitcode function into an 85 bitcode function result in 3500 bitcodes?

I guess with the __attribute__((noinline)) I at least found a reasonable workaround for the yaml problem, we can do that since it's code in our repo. But that doesn't solve my problem with the stockfish benchmark (and in general random code from our users).

dwblaikie commented 2 years ago

If there's a bunch of single-caller functions inlined into one function, a comparison might be best between all those functions separately, and the one monolithic function - do you have data about that?

rschatz commented 2 years ago

Ok, so I did some more analysis on the code.

What I did was generate two call graphs, one with inlining, the other without:

clang -S -O1 -I. -emit-llvm scanner.c
opt -disable-output --passes=print-callgraph scanner.ll > callgraph-withinline
clang -S -O1 -I. -fno-inline-functions -emit-llvm scanner.c
opt -disable-output --passes=print-callgraph scanner.ll > callgraph-noinline

I then wrote a little python script to walk the callgraph-noinline starting at the yaml_parser_fetch_more_tokens function, using callgraph-withinline to decide what was inlined into it and what not. The result is this dot file: https://gist.github.com/rschatz/c811b51b0524c24047877ca618e543db#file-inlined-functions-dot (link to graphviz online: https://gist.github.com/rschatz/c811b51b0524c24047877ca618e543db#file-graphviz-md)

The subgraph box is everything that's inlined into yaml_parser_fetch_more_tokens. Everything outside the box is functions that are still there after inlining. The sizes are just estimates, counting lines in the .ll files. The totalSize is a naive estimate of the size of the function plus inlined callees. Looking at how much is actually inlined, the size doesn't seem so surprising anymore.

One thing to note is that this is definitely not just inlining single-caller functions. The total sum of "insts" of all inlined and removed functions is only 3858, while the total size guess accounting for multiple callsites is 5050.

All of that still doesn't explain that first basic block that generates 700 SSA values spilled on the stack that are not immediately used.

One other thing I noticed while staring at the code for long enough is that there is a lot of redundancy in that first basic block that contains all these GEPs and bitcasts.

Looking at this bitcode: https://gist.github.com/rschatz/c811b51b0524c24047877ca618e543db#file-yaml_parser_fetch_more_tokens_o0 (produced with clang -S -O1 -I. -emit-llvm scanner.c).

For example getelementptr inbounds %struct.yaml_parser_s, %struct.yaml_parser_s* %0, i64 0, i32 24 is in there 11 times, all of them in the first basic block with no actual memory accesses in between. Also quite a few redundant bitcasts. For example, just a random snippet out of that file:

  %612 = bitcast %struct.yaml_mark_s* %10 to i8*
  %613 = bitcast %struct.yaml_mark_s* %8 to i8*
  %614 = bitcast %struct.yaml_mark_s* %8 to i8*
  %615 = getelementptr inbounds [52 x i8], [52 x i8]* %12, i64 0, i64 0
  %616 = getelementptr inbounds [52 x i8], [52 x i8]* %9, i64 0, i64 0
  %617 = bitcast %struct.yaml_mark_s* %10 to i8*

Could it be that there is some phase ordering problem? I would expect -O1 to be able to deal with this easily. Just to double check, I also tried with -O3 (https://gist.github.com/rschatz/c811b51b0524c24047877ca618e543db#file-yaml_parser_fetch_more_tokens_o3). In a way, it looks even worse:

  %190 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %191 = bitcast i64* %70 to <2 x i64>*
  %192 = bitcast i64* %70 to <2 x i64>*
  %193 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %194 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %195 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %196 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %197 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %198 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %199 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %200 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %201 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %202 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %203 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %204 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %205 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %206 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %207 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %208 = bitcast %struct.yaml_mark_s* %53 to <2 x i64>*
  %209 = bitcast i64* %70 to <2 x i64>*

EDIT: included link to online graphviz viewer for convenience