Closed dabek closed 2 years ago
Labeling as a regression because :rofl: something is going on:
rustc | time cargo build --release |
---|---|
1.0 - 1.18 | \<1 sec |
1.19 | 7.5 sec |
1.20 | 30 sec |
1.21 - 1.23 | 4 sec |
1.24 | 2 sec |
1.25 - 1.41-nightly | 30 minutes |
I got bored after 40 seconds (I am not a patient man) -- this reproduces at opt-level=3
, but not opt-level=2
. Hopefully that helps narrow it down.
On rustc 1.39.0 I see it hangs when running following command :
rustc -C opt-level=3 --emit link main.rs
however
rustc -C opt-level=3 --emit obj main.rs
generates .a file quickly , so it seems that linker is not happy with something generate by optimization.
-Ztime-passes is showing:
time: 0.226; rss: 134MB codegen
time: 0.002; rss: 134MB llvm function passes [slow.7rcbfp3g-cgu.4]
time: 0.001; rss: 134MB llvm function passes [slow.7rcbfp3g-cgu.2]
time: 0.004; rss: 134MB llvm module passes [slow.7rcbfp3g-cgu.2]
time: 0.014; rss: 136MB llvm module passes [slow.7rcbfp3g-cgu.4]
time: 0.039; rss: 136MB llvm module passes [slow.7rcbfp3g-cgu.3]
time: 0.050; rss: 136MB llvm function passes [slow.7rcbfp3g-cgu.0]
time: 24.948; rss: 267MB llvm module passes [slow.7rcbfp3g-cgu.0]
>> time: 4923.626; rss: 297MB llvm module passes [slow.7rcbfp3g-cgu.1]
time: 0.004; rss: 267MB LTO passes
time: 0.004; rss: 267MB LTO passes
time: 0.010; rss: 268MB codegen passes [slow.7rcbfp3g-cgu.2]
time: 0.010; rss: 268MB codegen passes [slow.7rcbfp3g-cgu.4]
time: 0.056; rss: 269MB LTO passes
time: 0.082; rss: 270MB codegen passes [slow.7rcbfp3g-cgu.3]
time: 25.502; rss: 270MB LTO passes
time: 3.664; rss: 279MB codegen passes [slow.7rcbfp3g-cgu.0]
It hasn't completed yet. I'm also running it under perf
. So far the top hot functions seem to be llvm::isSafeToSpeculativelyExecute
and llvm::SmallPtrSetImplBase::FindBucketFor
.
at nightly (rustc 1.43.0-nightly (8aa9d2014 2020-02-21)) around 2sec the big regression was due to before one huge function was generated for core::ptr::drop_in_place but now multiple smaller functions is generated and that makes llvm mush happier :) I do not know when this was changed only noticed it now.
Marking needstest to add a test to https://github.com/rust-lang-nursery/rustc-perf/
Failed to reproduce the fix at 8aa9d20 2020-02-21. Some attempts that I've tried:
~/src/issue66617 (master*) [11:26:27] 0
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -vV
rustc 1.43.0-nightly (8aa9d2014 2020-02-21)
binary: rustc
commit-hash: 8aa9d2014f4e5258f83b907e8431c59a33acdae7
commit-date: 2020-02-21
host: x86_64-apple-darwin
release: 1.43.0-nightly
LLVM version: 9.0
~/src/issue66617 (master*) [11:30:07] 1
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -C opt-level=3 -Ztime-passes ./src/main.rs
time: 0.006 parse_crate
time: 0.000 attributes_injection
time: 0.000 recursion_limit
time: 0.000 plugin_loading
time: 0.000 plugin_registration
time: 0.000 pre_AST_expansion_lint_checks
time: 0.000 crate_injection
time: 0.014 expand_crate
time: 0.000 check_unused_macros
time: 0.014 macro_expand_crate
time: 0.000 maybe_building_test_harness
time: 0.001 AST_validation
time: 0.000 maybe_create_a_macro_crate
time: 0.000 complete_gated_feature_checking
time: 0.017 configure_and_expand
time: 0.000 prepare_outputs
time: 0.002 hir_lowering
time: 0.000 early_lint_checks
time: 0.001 validate_HIR_map
time: 0.000 setup_global_ctxt
time: 0.000 dep_graph_tcx_init
time: 0.003 create_global_ctxt
time: 0.000 looking_for_entry_point
time: 0.000 looking_for_plugin_registrar
time: 0.000 looking_for_derive_registrar
time: 0.001 misc_checking_1
time: 0.006 type_collecting
time: 0.000 impl_wf_inference
time: 0.000 unsafety_checking
time: 0.000 orphan_checking
time: 0.002 coherence_checking
time: 0.007 wf_checking
time: 0.021 item_types_checking
time: 0.012 item_bodies_checking
time: 0.047 type_check_crate
time: 0.000 match_checking
time: 0.001 liveness_and_intrinsic_checking
time: 0.001 misc_checking_2
time: 0.005 MIR_borrow_checking
time: 0.000 dumping_chalk_like_clauses
time: 0.000 MIR_effect_checking
time: 0.000 layout_testing
time: 0.001 death_checking
time: 0.000 unused_lib_feature_checking
time: 0.005 crate_lints
time: 0.002 module_lints
time: 0.007 lint_checking
time: 0.002 privacy_checking_modules
time: 0.011 misc_checking_3
time: 0.000 monomorphization_collector_root_collections
time: 0.031 monomorphization_collector_graph_walk
time: 0.005 partition_and_assert_distinct_symbols
time: 0.000 write_allocator_module
time: 0.000 find_cgu_reuse
time: 0.031 LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.4)
time: 0.112 codegen_to_LLVM_IR
time: 0.000 assert_dep_graph
time: 0.000 serialize_dep_graph
time: 0.150 codegen_crate
time: 0.002 LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.1)
time: 0.001 LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.0)
time: 0.001 LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.3)
time: 0.001 LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.3)
time: 0.006 free_global_ctxt
time: 0.004 LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.0)
time: 0.011 LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.1)
time: 0.035 LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.2)
time: 10.943 LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.2)
It hasn't completed but I have waited for about 2 mins.
Here's a part of a perf report (I re-ran it on a linux machine):
Samples: 11K of event 'cpu-clock:pppH', Event count (approx.): 2875250000
Overhead Command Shared Object Symbol
32.88% rustc libLLVM-9-rust-1.43.0-nightly.so [.] llvm::isSafeToSpeculativelyExecute
27.12% rustc libLLVM-9-rust-1.43.0-nightly.so [.] llvm::SmallPtrSetImplBase::insert_imp_big
16.16% rustc libLLVM-9-rust-1.43.0-nightly.so [.] llvm::SmallPtrSetImplBase::Grow
10.98% rustc libLLVM-9-rust-1.43.0-nightly.so [.] appendSpeculatableOperands
4.83% rustc libLLVM-9-rust-1.43.0-nightly.so [.] completeEphemeralValues
1.43% rustc libc-2.27.so [.] __memset_avx2_erms
1.31% rustc libLLVM-9-rust-1.43.0-nightly.so [.] canTrapImpl
1.18% rustc libLLVM-9-rust-1.43.0-nightly.so [.] llvm::CodeMetrics::collectEphemeralValues
0.57% rustc libLLVM-9-rust-1.43.0-nightly.so [.] (anonymous namespace)::CallAnalyzer::analyzeCall
...
The newest master seems to have the same issue. I doubt it was fixed.
Assigning P-medium
as discussed as part of the Prioritization Working Group process and removing I-prioritize
.
Just in case, wanted to point out that this is still an issue using rustc 1.45.0-nightly (a74d1862d 2020-05-14)
.
I got some time to make progress on this today. Here's what's up:
Some observations:
-Z new-llvm-pass-manager
. This doesn't mean just wait for the new pass manager to become default and then close this issue, since it could be a problem that gets reintroduced as more stuff is ported to new pass manager. We wouldn't want to revert the workarounds in HHVM only for it to disruptively regress in another LLVM upgrade.-C codegen-units=1
. Presumably the function experiencing the catastrophic performance is erased by DCE or similar before the problematic pass.Progress:
rustc main.rs -C opt-level=3 -C codegen-units=2
and grabbed the LLVM IR prior to LLVM module passes. That IR is here: https://gist.githubusercontent.com/dtolnay/697d7ffdccaec1ecf6c2dd2d74ed2c36/raw/7d67205618280642faca37ad7a1206a991a58330/main.llrust/build/x86_64-unknown-linux-gnu/llvm/bin/opt -O3 main.ll -disable-output -time-passes
to break down the compile time by LLVM pass. The total time at O2 is <1 second, at O3 is ~100 seconds (I put fewer enum variants in my reduced test case than in the originally reported code which took 30 minutes, but add as many as you want). All but one pass takes negligible time:===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 99.8300 seconds (99.9532 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
98.6132 ( 98.9%) 0.0245 ( 34.9%) 98.6378 ( 98.8%) 98.7605 ( 98.8%) Function Integration/Inlining
0.1416 ( 0.1%) 0.0046 ( 6.5%) 0.1462 ( 0.1%) 0.1463 ( 0.1%) Combine redundant instructions #2
0.0660 ( 0.1%) 0.0021 ( 3.0%) 0.0681 ( 0.1%) 0.0682 ( 0.1%) Global Value Numbering
0.0576 ( 0.1%) 0.0020 ( 2.9%) 0.0596 ( 0.1%) 0.0596 ( 0.1%) Combine redundant instructions #5
0.0580 ( 0.1%) 0.0016 ( 2.3%) 0.0596 ( 0.1%) 0.0596 ( 0.1%) Combine redundant instructions #3
0.0577 ( 0.1%) 0.0013 ( 1.8%) 0.0590 ( 0.1%) 0.0589 ( 0.1%) Combine redundant instructions #4
0.0483 ( 0.0%) 0.0011 ( 1.6%) 0.0494 ( 0.0%) 0.0494 ( 0.0%) Jump Threading
0.0428 ( 0.0%) 0.0013 ( 1.8%) 0.0441 ( 0.0%) 0.0441 ( 0.0%) Early CSE w/ MemorySSA
0.0398 ( 0.0%) 0.0010 ( 1.4%) 0.0408 ( 0.0%) 0.0408 ( 0.0%) Memory SSA
...
I will take a basic look in the function integration/inlining pass later this week with profiling tools and then hand over to my company's LLVM team if needed.
Looks like all the time is spent inside CodeMetrics::collectEphemeralValues()
. There's also this helpful comment: https://github.com/llvm/llvm-project/blob/01ddb2a7b044f697a15043e47acdb93e2825809a/llvm/lib/Analysis/InlineCost.cpp#L2076-L2080
The behavior looks exactly cubic in the number of enum variants as measured by time rustc main.rs -C opt-level=3 -C codegen-units=2
on the reproducer in https://github.com/rust-lang/rust/issues/66617#issuecomment-648639873. Compile time for x variants is accurately predicted by some constant times x^3.
Update:
The cubic behavior is a legacy pass manager limitation which is a consequence of not having call site's BFI (block frequency) in the legacy inliner. This is a known advantage for inlining with the new pass manager (#74705) vs legacyPM.
Here is the place legacy inliner is lacking BFI under legacyPM (note nullptr for GetBFI): https://github.com/llvm/llvm-project/blob/3c2299612945caf75d5c3678ced0693ebd291819/llvm/lib/Transforms/IPO/InlineSimple.cpp#L71-L73
At this point we are going to investigate transitioning our affected codebase to newPM.
For anyone needing to work around this on legacyPM, one effective legacyPM workaround we found is running opt
with -inline-last-call-to-static-bonus=0
(would need to look up what the LLVM C API equivalent of this is), but it comes at the cost of possible code size regressions.
for this specific case where a lot of vec's is used in the enums I was able to get the times down by adding inline to RawVec::ptr this was one of the two functions that was defined in cgu0 but used in cgu1 and when inline was added it is defined in cgu1 also.
I do not understand why the llvm inline pass time is affected by if the function is defined or declared in one cgu but will try to look in to it some more tomorrow.
Is a PR with with the inline hint something to do? even as it do not solve the complete problem a lot of functions is marked with inline in RawVec but not this one liner function.
Can't reproduce in current stable (1.62.0
).
Thank you, @ljedrz and everyone else who has gone to the effort of reproducing or investigating this bug. Considering the current status and dtolnay tracing it directly to the pass manager,it looks like we can close this now.
Put this in your main.rs file: https://gist.github.com/dabek/3474504ab3dae4d264b78504a4d2ee21 Run:
See it take half an hour (at least it did for me the only time I had the patience to wait for it).