llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.29k stars 12.11k forks source link

[CompileTime] GVN takes 60+% of -O3 compile time (was: JumpThreading takes 29% of the wall O3 compile time) #17130

Open llvmbot opened 11 years ago

llvmbot commented 11 years ago
Bugzilla Link 16756
Version 3.2
OS MacOS X
Attachments Source file which causes slow compilation performance
Reporter LLVM Bugzilla Contributor
CC @abadams,@chandlerc,@dexonsmith,@efriedma-quic,@fhahn,@joker-eph,@darkbuck,@rnk,@yuanfang-chen

Extended Description

Halide is a not-especially-complicated C++ project (which happens to use LLVM internally, but that's not the subject of this bug):

https://github.com/halide/Halide/

Its build process is simple. On most platforms, GCC and MSVC toolchains compile of the C++ source quickly. On OS X (10.8), the stock g++ (llvm-gcc-4.2-based) compiles all but one source file quickly. But on CodeGen_ARM.cpp (https://github.com/halide/Halide/blob/master/src/CodeGen_ARM.cpp), it is pathologically slow (>7 minutes on a 2.8ghz Core2 Xeon Mac Pro, ~5 on a Sandybridge MacBook Air). The same file compiles in the expected second or three on any Homebrew GCC version, using the full GNU toolchain.

Clearly, this tickles something serious in the Apple/LLVM toolchain. The process which chugs for minutes during this is named "clang". The only obvious potential standout here is the relatively complex stack-allocated array "patterns" in CodeGen_ARM::visit(const Cast *op).

joker-eph commented 3 years ago

mentioned in issue llvm/llvm-bugzilla-archive#17855

fhahn commented 5 years ago

codegen_arm.bc, reproducer for compile time issue

fhahn commented 5 years ago

Majority of the compile time now spent in GVN. Updating the title to reflect that. Also attached codegen_arm.bc.

===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 10.1732 seconds (10.1731 wall clock)

---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 6.4170 ( 64.2%) 0.0243 ( 14.0%) 6.4413 ( 63.3%) 6.4422 ( 63.3%) Global Value Numbering 0.6697 ( 6.7%) 0.0046 ( 2.7%) 0.6743 ( 6.6%) 0.6743 ( 6.6%) Value Propagation 0.6127 ( 6.1%) 0.0064 ( 3.7%) 0.6191 ( 6.1%) 0.6191 ( 6.1%) Jump Threading 0.2233 ( 2.2%) 0.0035 ( 2.0%) 0.2268 ( 2.2%) 0.2268 ( 2.2%) Function Integration/Inlining 0.1531 ( 1.5%) 0.0018 ( 1.0%) 0.1550 ( 1.5%) 0.1549 ( 1.5%) Jump Threading #​2

joker-eph commented 9 years ago

Oh I see, yes that was my idea when I did the previous measurements. Do you think I should have closed this bug as fixed and opened a new one for JumpThreading?

rnk commented 9 years ago

As of r253350, nothing changed since my last update, still ~30% of the total clang invocation is spent in JumpThreading. Did you get different measurements? I measured on OS X.

No, but the original report was that the file took 7m to compile. You said it takes 12s now, despite the fact that 30% of the time is in JumpThreading. If you want to leave it open to track future improvements to JumpThreading or GVN, go for it.

joker-eph commented 9 years ago

As of r253350, nothing changed since my last update, still ~30% of the total clang invocation is spent in JumpThreading. Did you get different measurements? I measured on OS X.

rnk commented 9 years ago

Mehdi, sounds like this is fixed?

joker-eph commented 9 years ago

r245820 fixes the SROA issue and improves -O3 compile-time from 113s to 12s on my machine.

Top5 is now:

Running Time Self (ms) Symbol Name 3445.0ms 29.4% 5.0 (anonymous namespace)::JumpThreading::runOnFunction(llvm::Function&) 1483.0ms 12.6% 2.0 (anonymous namespace)::GVN::runOnFunction(llvm::Function&) 1063.0ms 9.0% 17.0 (anonymous namespace)::CorrelatedValuePropagation::runOnFunction(llvm::Function&) 697.0ms 5.9% 0.0 (anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) 589.0ms 5.0% 4.0 (anonymous namespace)::RegisterCoalescer::runOnMachineFunction(llvm::MachineFunction&)

3f18db19-85d0-42b5-b58f-dbfbd8cbce51 commented 10 years ago

Just profiled this at r206481.

The SROA slowdown is the same as http://llvm.org/bugs/show_bug.cgi?id=17855. Bottleneck is SSAUpdater.

Running Time Self Symbol Name 64684.0ms 43.5% 0.0 (anonymous namespace)::SROA::runOnFunction(llvm::Function&) 64677.0ms 43.5% 1.0 llvm::LoadAndStorePromoter::run(llvm::SmallVectorImpl<llvm::Instruction> const&) const 64676.0ms 43.5% 0.0 llvm::SSAUpdater::GetValueInMiddleOfBlock(llvm::BasicBlock) 64676.0ms 43.5% 0.0 llvm::SSAUpdater::GetValueAtEndOfBlockInternal(llvm::BasicBlock) 64672.0ms 43.5% 9.0 llvm::SSAUpdaterImpl::GetValue(llvm::BasicBlock) 64379.0ms 43.3% 63469.0 llvm::SSAUpdaterImpl::FindAvailableVals(llvm::SmallVectorImpl<llvm::SSAUpdaterImpl::BBInfo>)

However, even more time is spent in CorrelatedValuePropagation. The bottleneck there is LVIValueHandle::deleted().

Running Time Self Symbol Name 68279.0ms 46.0% 30.0 (anonymous namespace)::CorrelatedValuePropagation::runOnFunction(llvm::Function&) 65876.0ms 44.3% 10.0 llvm::Value::replaceAllUsesWith(llvm::Value) 65856.0ms 44.3% 12.0 llvm::ValueHandleBase::ValueIsRAUWd(llvm::Value, llvm::Value*) 65818.0ms 44.3% 65474.0 (anonymous namespace)::LVIValueHandle::deleted()

I was going to close llvm/llvm-bugzilla-archive#17855 as a dup, but now I'm thinking this PR should track the CorrelatedValuePropagation bottleneck, while llvm/llvm-bugzilla-archive#17855 tracks the SROA slowdown.

efriedma-quic commented 11 years ago

-ftime-report points at SROA. Probably unhappy because the CFG is extremely complex.

llvmbot commented 11 years ago

Full preprocessed source here: https://gist.github.com/jrk/6117757

Compiled on OS X with: c++ -O3 CodeGen_ARM.ii. (Without -O3, performance is nominal.)

llvmbot commented 11 years ago

CodeGen_ARM.cpp:1:10: fatal error: 'CodeGen_ARM.h' file not found

include "CodeGen_ARM.h"

     ^

Could you preprocess it and upload the .ii?