rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.96k stars 12.68k forks source link

Investigate memory usage of compiling the packed_simd crate #57829

Closed hsivonen closed 5 years ago

hsivonen commented 5 years ago

Steps to reproduce

  1. Create a new crate with cargo.
  2. Add packed_simd = '0.3.1' to Cargo.toml of the new crate.
  3. Build the new crate.

Actual results

While compiling packed_simd, rustc takes more than 2 GB of RAM.

Expected results

Lesser RAM usage.

Additional info

Maybe it's just the nature of packed_simd that it takes a lot of RAM to compile, and there's no bug. However, if RAM usage reached 3 GB in the future, the crate would become unbuildable on 32-bit systems. It might be worthwhile to investigate if building packed_simd has to take this much RAM or if there is an opportunity to use less RAM without adversely affecting compilation speed on systems that have plenty of RAM.

gnzlbg commented 5 years ago

cc @mw @nnethercote

matthiaskrgr commented 5 years ago

Looks like nll needs a lot of memory here

   Compiling packed_simd v0.3.1
  time: 0.054; rss: 57MB    parsing
  time: 0.000; rss: 58MB    attributes injection
  time: 0.000; rss: 58MB    recursion limit
  time: 0.000; rss: 58MB    crate injection
  time: 0.000; rss: 58MB    plugin loading
  time: 0.000; rss: 58MB    plugin registration
  time: 0.005; rss: 58MB    pre ast expansion lint checks
    time: 2.550; rss: 369MB expand crate
    time: 0.000; rss: 369MB check unused macros
  time: 2.550; rss: 369MB   expansion
  time: 0.000; rss: 369MB   maybe building test harness
  time: 0.012; rss: 369MB   maybe creating a macro crate
  time: 0.048; rss: 370MB   creating allocators
  time: 0.036; rss: 370MB   AST validation
  time: 0.497; rss: 412MB   name resolution
  time: 0.075; rss: 412MB   complete gated feature checking
  time: 0.321; rss: 481MB   lowering ast -> hir
  time: 0.081; rss: 482MB   early lint checks
    time: 0.052; rss: 504MB validate hir map
  time: 0.353; rss: 504MB   indexing hir
  time: 0.000; rss: 504MB   load query result cache
  time: 0.000; rss: 504MB   looking for entry point
  time: 0.000; rss: 504MB   dep graph tcx init
  time: 0.001; rss: 504MB   looking for plugin registrar
  time: 0.001; rss: 504MB   looking for derive registrar
  time: 0.019; rss: 504MB   loop checking
  time: 0.024; rss: 504MB   attribute checking
    time: 0.000; rss: 515MB solve_nll_region_constraints(DefId(0/1:2171 ~ packed_simd[a932]::v64[0]::f32x2[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 527MB solve_nll_region_constraints(DefId(0/1:4611 ~ packed_simd[a932]::vSize[0]::{{impl}}[587]::from[0]::U[0]::array[0]::{{constant}}[0]))
  time: 0.636; rss: 527MB   stability checking
  time: 0.124; rss: 527MB   type collecting
  time: 0.003; rss: 527MB   outlives testing
  time: 0.019; rss: 527MB   impl wf inference
    time: 0.000; rss: 1113MB    solve_nll_region_constraints(DefId(0/1:224 ~ packed_simd[a932]::codegen[0]::shuffle[0]::{{impl}}[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 1246MB    solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0]))
  time: 9.972; rss: 1408MB  coherence checking
  time: 0.002; rss: 1408MB  variance testing
    time: 0.000; rss: 1605MB    solve_nll_region_constraints(DefId(0/1:366 ~ packed_simd[a932]::codegen[0]::v16[0]::{{impl}}[0]::NT[0]::{{constant}}[0]))
*snip*
    time: 0.000; rss: 2013MB    solve_nll_region_constraints(DefId(0/0:4027 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[7]::any[0]))
    time: 0.000; rss: 2013MB    solve_nll_region_constraints(DefId(0/0:4053 ~ packed_simd[a932]::codegen[0]::reductions[0]::mask[0]::{{impl}}[17]::any[0]))
  time: 5.040; rss: 2013MB  MIR borrow checking
  time: 0.000; rss: 2013MB  dumping chalk-like clauses
  time: 0.005; rss: 2013MB  MIR effect checking
  time: 0.072; rss: 2018MB  death checking
  time: 0.021; rss: 2018MB  unused lib feature checking
  time: 0.176; rss: 2019MB  lint checking
  time: 0.000; rss: 2019MB  resolving dependency formats
    time: 0.890; rss: 2055MB    write metadata
      time: 0.010; rss: 2055MB  collecting roots
      time: 0.186; rss: 2056MB  collecting mono items
    time: 0.196; rss: 2056MB    monomorphization collection
    time: 0.001; rss: 2056MB    codegen unit partitioning
    time: 0.122; rss: 2060MB    codegen to LLVM IR
    time: 0.000; rss: 2060MB    assert dep graph
    time: 0.000; rss: 2060MB    serialize dep graph
  time: 1.215; rss: 2060MB  codegen
    time: 0.056; rss: 2063MB    llvm function passes [packed_simd.smey8184-cgu.0]
    time: 0.777; rss: 2071MB    llvm module passes [packed_simd.smey8184-cgu.0]
    time: 0.798; rss: 2079MB    codegen passes [packed_simd.smey8184-cgu.0]
  time: 1.703; rss: 1539MB  LLVM passes
  time: 0.000; rss: 1540MB  serialize work products
  time: 0.017; rss: 1540MB  linking
gnzlbg commented 5 years ago

Coherence checking also takes a good chunk of memory:

time: 0.000; rss: 1246MB    solve_nll_region_constraints(DefId(0/1:4867 ~ packed_simd[a932]::vPtr[0]::{{impl}}[104]::{{constant}}[0]))
  time: 9.972; rss: 1408MB  coherence checking

although NLL is the first suspect here. I wonder why NLL uses this much memory, packed_simd is full of methods, but the great majority of them are essentially one liners.

memoryruins commented 5 years ago

Reported the following spike of memory usage in #57432, which occurred after #56723

packed-simd-memory

mati865 commented 5 years ago

This one could be closed as duplicate of https://github.com/rust-lang/rust/issues/57432 I guess.

gnzlbg commented 5 years ago

EDIT: @mati865 you are right, these are duplicates, I thought that was a different issue that apparently never got filled, so forget this.


original comment:

@mati865 while they are related, they are two different issues:

nnethercote commented 5 years ago

I did a DHAT run. The "At t-gmax" measurement is the relevant one, it's short for "time of global max". It shows that the interning of constants within TypeFolder is accounting for over 54% of the global peak:

AP 1.1.1.1.1/2 (2 children) {
  Total:     912,261,120 bytes (12.02%, 7,312.63/Minstr) in 6 blocks (0%, 0/Minstr), avg size 152,043,520 bytes, avg lifetime 103,155,024,513.33 instrs (82.69% of program duration)
  At t-gmax: 912,261,120 bytes (54.74%) in 6 blocks (0%), avg size 152,043,520 bytes
  At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
  Reads:     1,827,458,569 bytes (4.97%, 14,648.81/Minstr), 2/byte
  Writes:    844,260,160 bytes (9.59%, 6,767.54/Minstr), 0.93/byte
  Allocated at {
    #1: 0xB66BCCB: alloc (alloc.rs:72)
    #2: 0xB66BCCB: alloc (alloc.rs:148)
    #3: 0xB66BCCB: allocate_in<u8,alloc::alloc::Global> (raw_vec.rs:96)
    #4: 0xB66BCCB: with_capacity<u8> (raw_vec.rs:140)
    #5: 0xB66BCCB: new<u8> (lib.rs:66)
    #6: 0xB66BCCB: arena::DroplessArena::grow (lib.rs:346)
    #7: 0x8C1BB25: alloc_raw (lib.rs:362)
    #8: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:378)
    #9: 0x8C1BB25: alloc<rustc::ty::sty::LazyConst> (lib.rs:465)
    #10: 0x8C1BB25: intern_lazy_const (context.rs:1123)
    #11: 0x8C1BB25: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_const (project.rs:423)
    #12: 0x8C1B235: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:1049)
    #13: 0x8C1B235: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:719)
    #14: 0x8C1B235: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #15: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #16: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #17: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #18: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #19: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #20: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #21: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #22: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #23: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #24: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #25: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #26: 0x8C1B183: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #27: 0x8C1B183: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:721)
    #28: 0x8C1B183: <rustc::traits::project::AssociatedTypeNormalizer<'a, 'b, 'gcx, 'tcx> as rustc::ty::fold::TypeFolder<'gcx, 'tcx>>::fold_ty (project.rs:337)
    #29: 0x890C0D0: fold_with<rustc::traits::project::AssociatedTypeNormalizer> (structural_impls.rs:769)
    #30: 0x890C0D0: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:135)
    #31: 0x890C0D0: fold_with<rustc::ty::subst::Kind,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #32: 0x890C0D0: {{closure}}<rustc::traits::project::AssociatedTypeNormalizer> (subst.rs:328)
    #33: 0x890C0D0: call_once<(&rustc::ty::subst::Kind),closure> (function.rs:279)
    #34: 0x890C0D0: map<&rustc::ty::subst::Kind,rustc::ty::subst::Kind,&mut closure> (option.rs:414)
    #35: 0x890C0D0: next<rustc::ty::subst::Kind,core::slice::Iter<rustc::ty::subst::Kind>,closure> (mod.rs:567)
    #36: 0x890C0D0: <smallvec::SmallVec<A> as core::iter::traits::collect::Extend<<A as smallvec::Array>::Item>>::extend (lib.rs:1349)
    #37: 0x8EF9787: from_iter<[rustc::ty::subst::Kind; 8],core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>> (lib.rs:1333)
    #38: 0x8EF9787: collect<core::iter::adapters::Map<core::slice::Iter<rustc::ty::subst::Kind>, closure>,smallvec::SmallVec<[rustc::ty::subst::Kind; 8]>> (iterator.rs:1466)
    #39: 0x8EF9787: rustc::ty::subst::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::List<rustc::ty::subst::Kind<'tcx>>>::super_fold_with (subst.rs:328)
    #40: 0x8BFE173: fold_with<&rustc::ty::List<rustc::ty::subst::Kind>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #41: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #42: 0x8BFE173: fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #43: 0x8BFE173: super_fold_with<rustc::ty::sty::TraitRef,rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:397)
    #44: 0x8BFE173: fold_with<core::option::Option<rustc::ty::sty::TraitRef>,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #45: 0x8BFE173: super_fold_with<rustc::traits::project::AssociatedTypeNormalizer> (macros.rs:344)
    #46: 0x8BFE173: fold_with<rustc::ty::ImplHeader,rustc::traits::project::AssociatedTypeNormalizer> (fold.rs:47)
    #47: 0x8BFE173: fold<rustc::ty::ImplHeader> (project.rs:315)
    #48: 0x8BFE173: normalize_with_depth<rustc::ty::ImplHeader> (project.rs:274)
    #49: 0x8BFE173: normalize<rustc::ty::ImplHeader> (project.rs:258)
    #50: 0x8BFE173: rustc::traits::coherence::with_fresh_ty_vars (coherence.rs:107)
nnethercote commented 5 years ago

@eddby @oli-obk @RalfJung Any thoughts on how to improve intern_lazy_const?

RalfJung commented 5 years ago

Cc @eddyb

nnethercote commented 5 years ago

Any thoughts on how to improve intern_lazy_const?

There is an obvious problem: intern_lazy_const doesn't intern the value! And the values passed are exceedingly repetitive. Here's a histogram of the top 10, which account for 97.2% of the calls:

17886042 counts:
(  1)  5253160 (29.4%, 29.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 2 }) })
(  2)  5192895 (29.0%, 58.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 4 }) })
(  3)  3928986 (22.0%, 80.4%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 8 }) })
(  4)  1600916 ( 9.0%, 89.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 16 }) })
(  5)   719785 ( 4.0%, 93.3%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 32 }) })
(  6)   299507 ( 1.7%, 95.0%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 1 }) })
(  7)   271847 ( 1.5%, 96.5%): Evaluated(Const { ty: usize, val: Scalar(Bits { size: 8, bits: 64 }) })
(  8)    61636 ( 0.3%, 96.9%): Unevaluated(DefId(0/1:4735 ~ packed_simd[3c0f]::vPtr[0]::mptrx4[0]::{{constant}}[0]), [])
(  9)    61636 ( 0.3%, 97.2%): Unevaluated(DefId(0/1:4823 ~ packed_simd[3c0f]::vPtr[0]::mptrx8[0]::{{constant}}[0]), [])
( 10)    61636 ( 0.3%, 97.6%): Unevaluated(DefId(0/1:4653 ~ packed_simd[3c0f]::vPtr[0]::mptrx2[0]::{{constant}}[0]), [])

Fixing this should drastically reduce the memory usage.

I tried doing the obvious thing by introducing GlobalCtxt::lazy_const_interner, heavily inspired by GlobalCtxt::layout_interner, but I couldn't get the lifetimes to work. I will try again tomorrow if nobody else beats me to it.

hsivonen commented 5 years ago

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

RalfJung commented 5 years ago

This just brought down my whole system -- 16GB of RAM used to be enough to compile two rustc in parallel (with 8 jobs each), but with the current RAM consumption that does not seem to be the case any more.

oli-obk commented 5 years ago

Can you try again with today's nightly?

hsivonen commented 5 years ago

FWIW, without the in-flight fix here, a relatively small tweak to packed_simd made packed_simd uncompilable on an ARMv7 system whose /proc/meminfo says there's 3624684 kB of RAM plus some swap. (And a Chrome OS kernel; I don't know what kind of swap use policy Chrome OS applies.)

I'll test again once the fix for this issue is in nightly.

Much better memory usage now. Thank you!

It seems it would be worthwhile to nominate this for uplift to beta, but I'm not permitted to add the tag myself.