rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.56k stars 12.74k forks source link

Long Compile Time (with Optimization) with Large Arrays (2-3 hours) #49330

Open telboon opened 6 years ago

telboon commented 6 years ago

Summary

I have gotten very long compile time (2 hrs plus) when I tried to compile a code that has a large array inbuilt in the code when I turn optimization on. When the code is compiled without optimization, it has a acceptable compile time (10+ seconds).

The array contains 130,000 + strs.

Details

I tried this code:

$ cargo build
   Compiling alliteration-gen v0.1.0 (file:///<REDACTED>)
    Finished dev [unoptimized + debuginfo] target(s) in 11.23 secs
$ cargo build --release
   Compiling alliteration-gen v0.1.0 (file:///<REDACTED>)
^C

I expected to see this happen: With the non-optimized code running in 11 seconds, the optimized one shouldn't take more than a few minutes.

Instead, this happened: When i waited for it to compile, it took 10039.81 secs (2 hrs 47 mins).

I replicated the problem again on a different Cargo project path and different machines and faced the same issue (didn't fully compile them as it would take very long. I waited for 5 minutes for each machine).

Environment

Machines tested:

  1. Ubuntu 16.04 (Dell XPS 9360 Intel i5-7200U; 8GB RAM)
  2. Kali Linux 2018.01 (Old Laptop with i5)
  3. Windows 10 with msvc (Desktop Intel i5 4570 3.2GHz; 8GB RAM)

Cargo project files attached. nolib.zip

Meta

rustc --version --verbose: rustc 1.22.1 binary: rustc commit-hash: unknown commit-date: unknown host: x86_64-unknown-linux-gnu release: 1.22.1 LLVM version: 4.0

rustc 1.24.1 binary: rustc commit-hash: unknown commit-date: unknown host: x86_64-unknown-linux-gnu release: 1.24.1 LLVM version: 4.0

rustc 1.24.1 (d3ae9a9e0 2018-02-27) binary: rustc commit-hash: d3ae9a9e08edf12de0ed82af57ba2a56c26496ea commit-date: 2018-02-27 host: x86_64-pc-windows-msvc release: 1.24.1 LLVM version: 4.0

matthiaskrgr commented 6 years ago

Seems llvm is bottlenecking here

  21,28%  librustc_trans-llvm.so                   [.] llvm::ValueHandleBase::AddToUseList
  12,09%  librustc_trans-llvm.so                   [.] llvm::Value::getContext
  11,45%  librustc_trans-llvm.so                   [.] llvm::ValueHandleBase::RemoveFromUseList
  10,25%  librustc_trans-llvm.so                   [.] llvm::ValueHandleBase::AddToExistingUseList
   9,40%  librustc_trans-llvm.so                   [.] llvm::Use::getImpliedUser
   3,45%  librustc_trans-llvm.so                   [.] llvm::InstCombiner::visitAllocSite
   1,12%  librustc_trans-llvm.so                   [.] llvm::SmallVectorTemplateBase<llvm::WeakTrackingVH, false>::grow
   0,98%  librustc_trans-llvm.so                   [.] llvm::GetElementPtrInst::hasAllZeroIndices
   0,96%  [kernel]                                 [k] clear_page_erms
   0,68%  libc-2.26.so                             [.] __memmove_sse2_unaligned_erms
   0,54%  librustc_trans-llvm.so                   [.] llvm::Use::getUser
   0,54%  [kernel]                                 [.] native_irq_return_iret
   0,44%  [i915]                                   [k] fw_domains_get

(can repro with rustc 1.26.0-nightly (c08480fce 2018-03-23) and doing release build)

matthiaskrgr commented 6 years ago

Ok, it only needed 3.5 hours to compiler on a single core!

timing information

   Compiling alliteration-gen v0.1.0 (file:///tmp/test/nolib)
  time: 0.311; rss: 101MB       parsing
  time: 0.000; rss: 101MB       recursion limit
  time: 0.000; rss: 101MB       crate injection
  time: 0.000; rss: 101MB       plugin loading
  time: 0.000; rss: 101MB       plugin registration
  time: 0.044; rss: 124MB       expansion
  time: 0.000; rss: 124MB       maybe building test harness
  time: 0.001; rss: 124MB       maybe creating a macro crate
  time: 0.006; rss: 124MB       creating allocators
  time: 0.004; rss: 124MB       AST validation
  time: 0.006; rss: 128MB       name resolution
  time: 0.001; rss: 128MB       complete gated feature checking
  time: 0.017; rss: 146MB       lowering ast -> hir
  time: 0.018; rss: 146MB       early lint checks
warning: The output of `-Z time-llvm-passes` will only reflect timings of re-translated modules when used with incremental compilation
  time: 0.057; rss: 162MB       indexing hir
  time: 0.000; rss: 124MB       load query result cache
  time: 0.000; rss: 124MB       looking for entry point
  time: 0.000; rss: 124MB       looking for plugin registrar
  time: 0.001; rss: 124MB       loop checking
  time: 0.000; rss: 124MB       attribute checking
  time: 0.002; rss: 124MB       stability checking
  time: 0.019; rss: 139MB       type collecting
  time: 0.000; rss: 139MB       outlives testing
  time: 0.000; rss: 139MB       impl wf inference
  time: 0.000; rss: 139MB       coherence checking
  time: 0.000; rss: 139MB       variance testing
  time: 0.012; rss: 143MB       wf checking
  time: 0.000; rss: 143MB       item-types checking
  time: 1.599; rss: 204MB       item-bodies checking
  time: 0.105; rss: 211MB       rvalue promotion
  time: 0.034; rss: 211MB       privacy checking
  time: 0.002; rss: 211MB       intrinsic checking
  time: 0.002; rss: 211MB       match checking
  time: 0.003; rss: 211MB       liveness checking
  time: 1.246; rss: 423MB       borrow checking
  time: 0.000; rss: 423MB       MIR borrow checking
  time: 0.000; rss: 423MB       MIR effect checking
  time: 0.003; rss: 423MB       death checking
  time: 0.000; rss: 423MB       unused lib feature checking
  time: 0.098; rss: 423MB       lint checking
  time: 0.000; rss: 423MB       dumping chalk-like clauses
  time: 0.000; rss: 423MB       resolving dependency formats
    time: 0.000; rss: 423MB     write metadata
    time: 0.450; rss: 450MB     translation item collection
    time: 0.000; rss: 450MB     codegen unit partitioning
    time: 0.000; rss: 450MB     write allocator module
    time: 1.286; rss: 622MB     translate to LLVM IR
    time: 0.000; rss: 622MB     assert dep graph
    time: 0.000; rss: 622MB     serialize dep graph
  time: 1.739; rss: 622MB       translation
    time: 0.833; rss: 384MB     llvm function passes [alliteration_gen0]
    time: 9852.241; rss: 546MB  llvm module passes [alliteration_gen0]
    time: 1685.726; rss: 463MB  codegen passes [alliteration_gen0]
  time: 11539.432; rss: 163MB   LLVM passes
===-------------------------------------------------------------------------===
                              Register Allocation
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0205 seconds (0.0205 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0173 ( 99.8%)   0.0032 (100.0%)   0.0205 ( 99.8%)   0.0205 ( 99.8%)  Seed Live Regs
   0.0000 (  0.2%)   0.0000 (  0.0%)   0.0000 (  0.2%)   0.0000 (  0.2%)  Evict
   0.0174 (100.0%)   0.0032 (100.0%)   0.0205 (100.0%)   0.0205 (100.0%)  Total
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 1535.6036 seconds (1536.6184 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  1529.1632 ( 99.6%)   0.2696 ( 73.8%)  1529.4328 ( 99.6%)  1530.4475 ( 99.6%)  DAG Combining 1
   2.3814 (  0.2%)   0.0135 (  3.7%)   2.3949 (  0.2%)   2.3950 (  0.2%)  DAG Combining 2
   1.9965 (  0.1%)   0.0300 (  8.2%)   2.0265 (  0.1%)   2.0265 (  0.1%)  Instruction Selection
   0.8134 (  0.1%)   0.0065 (  1.8%)   0.8200 (  0.1%)   0.8200 (  0.1%)  DAG Legalization
   0.4507 (  0.0%)   0.0232 (  6.3%)   0.4739 (  0.0%)   0.4739 (  0.0%)  Instruction Scheduling
   0.2092 (  0.0%)   0.0000 (  0.0%)   0.2092 (  0.0%)   0.2092 (  0.0%)  Type Legalization
   0.1781 (  0.0%)   0.0168 (  4.6%)   0.1949 (  0.0%)   0.1949 (  0.0%)  Instruction Creation
   0.0343 (  0.0%)   0.0000 (  0.0%)   0.0343 (  0.0%)   0.0343 (  0.0%)  Vector Legalization
   0.0111 (  0.0%)   0.0056 (  1.5%)   0.0167 (  0.0%)   0.0167 (  0.0%)  Instruction Scheduling Cleanup
   0.0004 (  0.0%)   0.0000 (  0.0%)   0.0004 (  0.0%)   0.0004 (  0.0%)  DAG Combining after legalize types
  1535.2383 (100.0%)   0.3653 (100.0%)  1535.6036 (100.0%)  1536.6184 (100.0%)  Total
===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.3086 seconds (0.3233 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.1348 ( 54.3%)   0.0351 ( 58.2%)   0.1699 ( 55.1%)   0.1843 ( 57.0%)  Debug Info Emission
   0.1115 ( 44.9%)   0.0247 ( 40.9%)   0.1362 ( 44.2%)   0.1366 ( 42.2%)  DWARF Exception Writer
   0.0019 (  0.8%)   0.0005 (  0.9%)   0.0024 (  0.8%)   0.0024 (  0.7%)  DWARF Debug Writer
   0.2482 (100.0%)   0.0603 (100.0%)   0.3086 (100.0%)   0.3233 (100.0%)  Total
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 11517.6105 seconds (11536.0309 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  9049.1177 ( 84.2%)  764.1362 ( 99.4%)  9813.2539 ( 85.2%)  9830.6488 ( 85.2%)  Combine redundant instructions
  1536.5964 ( 14.3%)   0.4161 (  0.1%)  1537.0125 ( 13.3%)  1538.0273 ( 13.3%)  Demanded bits analysis
  138.7323 (  1.3%)   3.3021 (  0.4%)  142.0344 (  1.2%)  142.1022 (  1.2%)  Machine Instruction Scheduler
  10.6652 (  0.1%)   0.0102 (  0.0%)  10.6754 (  0.1%)  10.6762 (  0.1%)  Dead Store Elimination
   2.7106 (  0.0%)   0.0500 (  0.0%)   2.7606 (  0.0%)   2.7606 (  0.0%)  SLP Vectorizer
   0.7723 (  0.0%)   0.0235 (  0.0%)   0.7958 (  0.0%)   0.7958 (  0.0%)  Combine redundant instructions
   0.7857 (  0.0%)   0.0069 (  0.0%)   0.7927 (  0.0%)   0.7927 (  0.0%)  Combine redundant instructions
   0.7482 (  0.0%)   0.0240 (  0.0%)   0.7723 (  0.0%)   0.7724 (  0.0%)  Combine redundant instructions
   0.7228 (  0.0%)   0.0233 (  0.0%)   0.7461 (  0.0%)   0.7461 (  0.0%)  Combine redundant instructions
   0.7324 (  0.0%)   0.0135 (  0.0%)   0.7459 (  0.0%)   0.7459 (  0.0%)  Combine redundant instructions
   0.5270 (  0.0%)   0.0265 (  0.0%)   0.5536 (  0.0%)   0.5536 (  0.0%)  X86 Assembly Printer
   0.4229 (  0.0%)   0.0252 (  0.0%)   0.4481 (  0.0%)   0.4481 (  0.0%)  Early CSE
   0.3849 (  0.0%)   0.0412 (  0.0%)   0.4260 (  0.0%)   0.4260 (  0.0%)  Global Value Numbering
   0.3855 (  0.0%)   0.0135 (  0.0%)   0.3991 (  0.0%)   0.3991 (  0.0%)  Combine redundant instructions
   0.3750 (  0.0%)   0.0133 (  0.0%)   0.3883 (  0.0%)   0.3883 (  0.0%)  Combine redundant instructions
   0.3343 (  0.0%)   0.0434 (  0.0%)   0.3776 (  0.0%)   0.3776 (  0.0%)  Called Value Propagation
   0.3229 (  0.0%)   0.0170 (  0.0%)   0.3399 (  0.0%)   0.3399 (  0.0%)  Early CSE w/ MemorySSA
   0.2652 (  0.0%)   0.0236 (  0.0%)   0.2887 (  0.0%)   0.2887 (  0.0%)  Value Propagation
....
    Finished release [optimized] target(s) in 11580.25 secs
bluss commented 6 years ago

If you come across this issue and just wonder how to work around it, one can use slices instead of arrays for this code.

The diff for the reproducing code is like this, and then it compiles reasonably quickly (20 seconds) in release mode:

diff --git src/main.rs src/main.rs
index 6a6d2a3..69be593 100644
--- src/main.rs
+++ src/main.rs
@@ -8,7 +8,7 @@ fn main() {
 }

 fn gen_adj(alpha:&String) -> String {
-    let list_of_adj = [
+    let list_of_adj: &[&str] = &[
         "a-ok",
         "a-okay",
         "a-one",
@@ -21268,7 +21268,7 @@ fn gen_adj(alpha:&String) -> String {
 }

 fn gen_noun(alpha:&String) -> String {
-    let list_of_noun= [
+    let list_of_noun: &[&str] = &[
         "a'man",
         "a-bomb",
         "a-horizon",

The original code was using very large arrays, while using a static literal slice works just as well for the use case.

oli-obk commented 6 years ago

The array was not const promoted I presume, while the slice is. Thus LLVM is trying to cope with a gigantic stack allocation.

steveklabnik commented 4 years ago

An update two and a half years later. This includes building dependencies:

❯ rustc +nightly --version -v
rustc 1.48.0-nightly (e2be5f568 2020-09-09)
binary: rustc
commit-hash: e2be5f568d1f60365b825530f5b5cb722460591b
commit-date: 2020-09-09
host: x86_64-pc-windows-msvc
release: 1.48.0-nightly
LLVM version: 11.0
❯ cargo +nightly build
<snip>
    Finished dev [unoptimized + debuginfo] target(s) in 8m 42s
❯ cargo +nightly build --release
<snip>

... so actually I think that this got worse. I left it running for the past 8 hours, overnight, and it shows no signs of stopping. THis is on my pretty beefy desktop too.