[MC] Compiler performance regression in Clang 19 with -mbranches-within-32B-boundaries

vient commented 2 months ago

I'm building the same code with clang 18 and 19, and noticed that some target build times are disproportionately affected by switching to new compiler - in general Clang 19 is 5-10% slower but an LTO build of one particular target slowed down x2.5

Tried --time-trace but don't know what to make of it other than that OptModule got some long tails in Clang 19. First worker under main thread is building the same module in both images so can be directly compared - OptModule time increased from 1m20s to 5m24s, x4

913.621213 Total OptModule
856.716409 Total OptFunction
856.192565 Total RunPass
556.340514 Total PassManager<Function>
512.635569 Total ModuleInlinerWrapperPass
510.885891 Total ModuleToPostOrderCGSCCPassAdaptor
509.09462 Total DevirtSCCRepeatedPass
507.621024 Total PassManager<LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult &>
434.932548 Total CGSCCToFunctionPassAdaptor
142.495075 Total ExecuteLinker
142.421367 Total Link
141.506523 Total LTO
132.923099 Total InstCombinePass
124.003487 Total ModuleToFunctionPassAdaptor

3237.53794 Total OptModule
845.04484 Total OptFunction
844.38391 Total RunPass
552.922664 Total PassManager<Function>
497.867448 Total ModuleInlinerWrapperPass
495.840083 Total ModuleToPostOrderCGSCCPassAdaptor
493.816647 Total DevirtSCCRepeatedPass
492.195245 Total PassManager<LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult &>
417.747014 Total CGSCCToFunctionPassAdaptor
385.505297 Total ExecuteLinker
385.437975 Total Link
384.301031 Total LTO
141.092082 Total InstCombinePass
137.907089 Total ModuleToFunctionPassAdaptor

perf trace and manual breaking in gdb show that a lot of time is spent around

llvm::MCAssembler::layout() ()
llvm::MCObjectStreamer::finishImpl() ()
llvm::MCELFStreamer::finishImpl() ()
llvm::AsmPrinter::doFinalization(llvm::Module&) ()
llvm::FPPassManager::doFinalization(llvm::Module&) ()
llvm::legacy::PassManagerImpl::run(llvm::Module&) ()

and also llvm::MCExpr::evaluateAsRelocatableImpl. My current build is stripped though, I'll return back with trace results with debug symbols later.

vient commented 2 months ago

@MaskRay you have recent commits in evaluateAsRelocatable - may you have an idea what changes in LLVM 19 can cause such regression?

vient commented 2 months ago

Top functions machine code part became a lot slower in LLVM 19, there are no MC functions near the top in LLVM 18.

Don't know why perf does not show inlined functions, here are hottest instructions of first three functions

llvm::ELFObjectWriter::isSymbolRefDifferenceFullyResolvedImpl(llvm::MCAssembler const&, llvm::MCSymbol const&, llvm::MCFragment const&, bool, bool) const at llvm/lib/MC/ELFObjectWriter.cpp:1447:29
 (inlined by) llvm::MCObjectWriter::isSymbolRefDifferenceFullyResolved(llvm::MCAssembler const&, llvm::MCSymbolRefExpr const*, llvm::MCSymbolRefExpr const*, bool) const at llvm/lib/MC/MCObjectWriter.cpp:45:10
 (inlined by) AttemptToFoldSymbolOffsetDifference(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCSymbolRefExpr const*&, llvm::MCSymbolRefExpr const*&, long&) at llvm/lib/MC/MCExpr.cpp:601:25
 (inlined by) evaluateSymbolicAdd(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCValue const&, llvm::MCValue const&, llvm::MCValue&) at llvm/lib/MC/MCExpr.cpp:768:5
 (inlined by) llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:950:16

llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:819:3

evaluateSymbolicAdd(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCValue const&, llvm::MCValue const&, llvm::MCValue&) at llvm/lib/MC/MCExpr.cpp:755:7
 (inlined by) llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:950:16

llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1285:3
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10

llvm::MCAssembler::relaxBoundaryAlign(llvm::MCBoundaryAlignFragment&) at llvm/lib/MC/MCAssembler.cpp:1189:8
 (inlined by) llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1299:12
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10

 llvm::SmallVectorBase<unsigned long>::size() const at llvm/include/llvm/ADT/SmallVector.h:92:32
 (inlined by) llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:0:0
 (inlined by) llvm::MCAssembler::relaxBoundaryAlign(llvm::MCBoundaryAlignFragment&) at llvm/lib/MC/MCAssembler.cpp:1195:20
 (inlined by) llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1299:12
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10

llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:251:3
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:447:15

llvm::MCAssembler::isBundlingEnabled() const at llvm/include/llvm/MC/MCAssembler.h:208:59
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:443:9

llvm::MCBoundaryAlignFragment::getSize() const at llvm/include/llvm/MC/MCFragment.h:580:37
 (inlined by) llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:281:45
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:447:15

vient commented 2 months ago

Don't know how I missed this post https://maskray.me/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19 @aengelke do you know if this slowdown is expected? I get from the post that mentioned code parts are supposed to become faster in LLVM 19?

aengelke commented 2 months ago

Which architecture? Is this NaCl? (NaCl regressions might be caused by #94950, where I removed MCCompactEncodedInstFragment.) Other than NaCl, this looks like a regression. MaskRay was working on layouting.

vient commented 2 months ago

x86_64, not NaCl. I think I'm onto something - difference went away when I removed these options

-Wall
-Wextra
-Werror
-pedantic
-Wold-style-cast
-fvisibility=hidden
-fvisibility-inlines-hidden
-Wconversion
-Wsign-conversion
-Wunreachable-code
-Wno-missing-braces
-Wframe-larger-than=2500000
-ffile-prefix-map=/home/rlozko/git/twix=.
-fveclib=libmvec
-fdiagnostics-absolute-paths
-Wno-error=deprecated-declarations
-mbranches-within-32B-boundaries
-Wno-gnu-zero-variadic-macro-arguments
-Wno-enum-constexpr-conversion
-Wno-deprecated-declarations
-fcolor-diagnostics

I'll post later what options exactly affect this - the process is slow, each run takes 20-40 minutes :)

vient commented 2 months ago

Got it, slowdown goes away when -mbranches-within-32B-boundaries is removed - in my case it speeds up linkage more than 2 times. Can't find any recent commits related to this flag, sounds directly related to code layout.

aengelke commented 2 months ago

Thanks for investigating! This makes some sense, with this option, every instruction gets a new, separate fragment, so that relaxations can be applied later. The code path isn't optimized, as the option is rarely used. Not sure what's causing the regression compared to LLVM 18, though.

MaskRay commented 2 months ago

Don't know how I missed this post maskray.me/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19

The way we relax MCFragments might be related. It's possible that uncommon configurations like -mbranches-within-32B-boundaries are regressed while normal code paths get faster. Complex expression evaluation, primarily used by the Linux kernel, imposes relaxation schemes we could apply (#100283). I believe it's challenging to ensure that every use case is fast. The current way that optimizes the normal code path and penalizes uncommon -mbranches-within-32B-boundaries is likely favorable.

vient commented 2 months ago

We use this option because some of our hosts are Skylake-based, and some workloads are affected by JCC erratum - don't know why the others workloads are not. For a workaround, I've put -mbranches-within-32B-boundaries under if(ARCH MATCHES "^(skylake|cascadelake)"). It occurred that, strangely, the same workloads that benefit from this option on Skylake (~5% improvement) are negatively affected by it on other platforms (~2% slowdown).

Overall, can't say that this issue affects us in a serious way. If I understand right that this issue gets a WONTFIX by you, it can be closed.

llvm / llvm-project

[MC] Compiler performance regression in Clang 19 with -mbranches-within-32B-boundaries #107754