rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.97k stars 12.69k forks source link

Single file takes 30 minutes to compile at opt-level=3, regression from <1 sec #66617

Closed dabek closed 2 years ago

dabek commented 4 years ago

Put this in your main.rs file: https://gist.github.com/dabek/3474504ab3dae4d264b78504a4d2ee21 Run:

cargo run --release

See it take half an hour (at least it did for me the only time I had the patience to wait for it).

dtolnay commented 4 years ago

Labeling as a regression because :rofl: something is going on:

rustc time cargo build --release
1.0 - 1.18 \<1 sec
1.19 7.5 sec
1.20 30 sec
1.21 - 1.23 4 sec
1.24 2 sec
1.25 - 1.41-nightly 30 minutes
alex commented 4 years ago

I got bored after 40 seconds (I am not a patient man) -- this reproduces at opt-level=3, but not opt-level=2. Hopefully that helps narrow it down.

vivekvpandya commented 4 years ago

On rustc 1.39.0 I see it hangs when running following command : rustc -C opt-level=3 --emit link main.rs

however rustc -C opt-level=3 --emit obj main.rs generates .a file quickly , so it seems that linker is not happy with something generate by optimization.

jsgf commented 4 years ago

-Ztime-passes is showing:

  time: 0.226; rss: 134MB       codegen
    time: 0.002; rss: 134MB     llvm function passes [slow.7rcbfp3g-cgu.4]
    time: 0.001; rss: 134MB     llvm function passes [slow.7rcbfp3g-cgu.2]
    time: 0.004; rss: 134MB     llvm module passes [slow.7rcbfp3g-cgu.2]
    time: 0.014; rss: 136MB     llvm module passes [slow.7rcbfp3g-cgu.4]
    time: 0.039; rss: 136MB     llvm module passes [slow.7rcbfp3g-cgu.3]
    time: 0.050; rss: 136MB     llvm function passes [slow.7rcbfp3g-cgu.0]
    time: 24.948; rss: 267MB    llvm module passes [slow.7rcbfp3g-cgu.0]
>>  time: 4923.626; rss: 297MB  llvm module passes [slow.7rcbfp3g-cgu.1]
    time: 0.004; rss: 267MB     LTO passes
    time: 0.004; rss: 267MB     LTO passes
    time: 0.010; rss: 268MB     codegen passes [slow.7rcbfp3g-cgu.2]
    time: 0.010; rss: 268MB     codegen passes [slow.7rcbfp3g-cgu.4]
    time: 0.056; rss: 269MB     LTO passes
    time: 0.082; rss: 270MB     codegen passes [slow.7rcbfp3g-cgu.3]
    time: 25.502; rss: 270MB    LTO passes
    time: 3.664; rss: 279MB     codegen passes [slow.7rcbfp3g-cgu.0]

It hasn't completed yet. I'm also running it under perf. So far the top hot functions seem to be llvm::isSafeToSpeculativelyExecute and llvm::SmallPtrSetImplBase::FindBucketFor.

andjo403 commented 4 years ago

at nightly (rustc 1.43.0-nightly (8aa9d2014 2020-02-21)) around 2sec the big regression was due to before one huge function was generated for core::ptr::drop_in_place but now multiple smaller functions is generated and that makes llvm mush happier :) I do not know when this was changed only noticed it now.

jonas-schievink commented 4 years ago

Marking needstest to add a test to https://github.com/rust-lang-nursery/rustc-perf/

xttjsn commented 4 years ago

Failed to reproduce the fix at 8aa9d20 2020-02-21. Some attempts that I've tried:

~/src/issue66617 (master*) [11:26:27] 0
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -vV
rustc 1.43.0-nightly (8aa9d2014 2020-02-21)
binary: rustc
commit-hash: 8aa9d2014f4e5258f83b907e8431c59a33acdae7
commit-date: 2020-02-21
host: x86_64-apple-darwin
release: 1.43.0-nightly
LLVM version: 9.0
~/src/issue66617 (master*) [11:30:07] 1
xtt$ rustc +8aa9d2014f4e5258f83b907e8431c59a33acdae7 -C opt-level=3 -Ztime-passes ./src/main.rs
time: 0.006     parse_crate
time: 0.000     attributes_injection
time: 0.000     recursion_limit
time: 0.000     plugin_loading
time: 0.000     plugin_registration
time: 0.000     pre_AST_expansion_lint_checks
time: 0.000     crate_injection
time: 0.014     expand_crate
time: 0.000     check_unused_macros
time: 0.014     macro_expand_crate
time: 0.000     maybe_building_test_harness
time: 0.001     AST_validation
time: 0.000     maybe_create_a_macro_crate
time: 0.000     complete_gated_feature_checking
time: 0.017     configure_and_expand
time: 0.000     prepare_outputs
time: 0.002     hir_lowering
time: 0.000     early_lint_checks
time: 0.001     validate_HIR_map
time: 0.000     setup_global_ctxt
time: 0.000     dep_graph_tcx_init
time: 0.003     create_global_ctxt
time: 0.000     looking_for_entry_point
time: 0.000     looking_for_plugin_registrar
time: 0.000     looking_for_derive_registrar
time: 0.001     misc_checking_1
time: 0.006     type_collecting
time: 0.000     impl_wf_inference
time: 0.000     unsafety_checking
time: 0.000     orphan_checking
time: 0.002     coherence_checking
time: 0.007     wf_checking
time: 0.021     item_types_checking
time: 0.012     item_bodies_checking
time: 0.047     type_check_crate
time: 0.000     match_checking
time: 0.001     liveness_and_intrinsic_checking
time: 0.001     misc_checking_2
time: 0.005     MIR_borrow_checking
time: 0.000     dumping_chalk_like_clauses
time: 0.000     MIR_effect_checking
time: 0.000     layout_testing
time: 0.001     death_checking
time: 0.000     unused_lib_feature_checking
time: 0.005     crate_lints
time: 0.002     module_lints
time: 0.007     lint_checking
time: 0.002     privacy_checking_modules
time: 0.011     misc_checking_3
time: 0.000     monomorphization_collector_root_collections
time: 0.031     monomorphization_collector_graph_walk
time: 0.005     partition_and_assert_distinct_symbols
time: 0.000     write_allocator_module
time: 0.000     find_cgu_reuse
time: 0.031     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.4)
time: 0.112     codegen_to_LLVM_IR
time: 0.000     assert_dep_graph
time: 0.000     serialize_dep_graph
time: 0.150     codegen_crate
time: 0.002     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.1)
time: 0.001     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.0)
time: 0.001     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.3)
time: 0.001     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.3)
time: 0.006     free_global_ctxt
time: 0.004     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.0)
time: 0.011     LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.1)
time: 0.035     LLVM_module_optimize_function_passes(main.7rcbfp3g-cgu.2)
time: 10.943    LLVM_module_optimize_module_passes(main.7rcbfp3g-cgu.2)

It hasn't completed but I have waited for about 2 mins.

Here's a part of a perf report (I re-ran it on a linux machine):

Samples: 11K of event 'cpu-clock:pppH', Event count (approx.): 2875250000
Overhead  Command  Shared Object                     Symbol
  32.88%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::isSafeToSpeculativelyExecute
  27.12%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::SmallPtrSetImplBase::insert_imp_big
  16.16%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::SmallPtrSetImplBase::Grow
  10.98%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] appendSpeculatableOperands
   4.83%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] completeEphemeralValues
   1.43%  rustc    libc-2.27.so                      [.] __memset_avx2_erms
   1.31%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] canTrapImpl
   1.18%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] llvm::CodeMetrics::collectEphemeralValues
   0.57%  rustc    libLLVM-9-rust-1.43.0-nightly.so  [.] (anonymous namespace)::CallAnalyzer::analyzeCall
...

The newest master seems to have the same issue. I doubt it was fixed.

spastorino commented 4 years ago

Assigning P-medium as discussed as part of the Prioritization Working Group process and removing I-prioritize.

spastorino commented 4 years ago

Just in case, wanted to point out that this is still an issue using rustc 1.45.0-nightly (a74d1862d 2020-05-14).

dtolnay commented 4 years ago

I got some time to make progress on this today. Here's what's up:

Some observations:

Progress:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 99.8300 seconds (99.9532 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  98.6132 ( 98.9%)   0.0245 ( 34.9%)  98.6378 ( 98.8%)  98.7605 ( 98.8%)  Function Integration/Inlining
   0.1416 (  0.1%)   0.0046 (  6.5%)   0.1462 (  0.1%)   0.1463 (  0.1%)  Combine redundant instructions #2
   0.0660 (  0.1%)   0.0021 (  3.0%)   0.0681 (  0.1%)   0.0682 (  0.1%)  Global Value Numbering
   0.0576 (  0.1%)   0.0020 (  2.9%)   0.0596 (  0.1%)   0.0596 (  0.1%)  Combine redundant instructions #5
   0.0580 (  0.1%)   0.0016 (  2.3%)   0.0596 (  0.1%)   0.0596 (  0.1%)  Combine redundant instructions #3
   0.0577 (  0.1%)   0.0013 (  1.8%)   0.0590 (  0.1%)   0.0589 (  0.1%)  Combine redundant instructions #4
   0.0483 (  0.0%)   0.0011 (  1.6%)   0.0494 (  0.0%)   0.0494 (  0.0%)  Jump Threading
   0.0428 (  0.0%)   0.0013 (  1.8%)   0.0441 (  0.0%)   0.0441 (  0.0%)  Early CSE w/ MemorySSA
   0.0398 (  0.0%)   0.0010 (  1.4%)   0.0408 (  0.0%)   0.0408 (  0.0%)  Memory SSA
...

I will take a basic look in the function integration/inlining pass later this week with profiling tools and then hand over to my company's LLVM team if needed.

nikic commented 4 years ago

Looks like all the time is spent inside CodeMetrics::collectEphemeralValues(). There's also this helpful comment: https://github.com/llvm/llvm-project/blob/01ddb2a7b044f697a15043e47acdb93e2825809a/llvm/lib/Analysis/InlineCost.cpp#L2076-L2080

dtolnay commented 4 years ago

The behavior looks exactly cubic in the number of enum variants as measured by time rustc main.rs -C opt-level=3 -C codegen-units=2 on the reproducer in https://github.com/rust-lang/rust/issues/66617#issuecomment-648639873. Compile time for x variants is accurately predicted by some constant times x^3.


slow

dtolnay commented 4 years ago

Update:

The cubic behavior is a legacy pass manager limitation which is a consequence of not having call site's BFI (block frequency) in the legacy inliner. This is a known advantage for inlining with the new pass manager (#74705) vs legacyPM.

Here is the place legacy inliner is lacking BFI under legacyPM (note nullptr for GetBFI): https://github.com/llvm/llvm-project/blob/3c2299612945caf75d5c3678ced0693ebd291819/llvm/lib/Transforms/IPO/InlineSimple.cpp#L71-L73

At this point we are going to investigate transitioning our affected codebase to newPM.

For anyone needing to work around this on legacyPM, one effective legacyPM workaround we found is running opt with -inline-last-call-to-static-bonus=0 (would need to look up what the LLVM C API equivalent of this is), but it comes at the cost of possible code size regressions.

andjo403 commented 4 years ago

for this specific case where a lot of vec's is used in the enums I was able to get the times down by adding inline to RawVec::ptr this was one of the two functions that was defined in cgu0 but used in cgu1 and when inline was added it is defined in cgu1 also.

I do not understand why the llvm inline pass time is affected by if the function is defined or declared in one cgu but will try to look in to it some more tomorrow.

Is a PR with with the inline hint something to do? even as it do not solve the complete problem a lot of functions is marked with inline in RawVec but not this one liner function.

ljedrz commented 2 years ago

Can't reproduce in current stable (1.62.0).

workingjubilee commented 2 years ago

Thank you, @ljedrz and everyone else who has gone to the effort of reproducing or investigating this bug. Considering the current status and dtolnay tracing it directly to the pass manager,it looks like we can close this now.