rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.39k stars 12.73k forks source link

Performance of helloworld5000 could be improved #50994

Closed nnethercote closed 6 months ago

nnethercote commented 6 years ago

helloworld5000 is the name I've given to the benchmark that is helloworld with the println!("Hello world"); repeated 5,000 times. It's an interesting stress test for the compiler.

On my machine, a debug build takes 4.5 seconds and an opt build takes 62(!) seconds.

In the debug build, execution time is dominated by take_and_reset_data. Cachegrind measures these instruction counts:

28,047,781,500  PROGRAM TOTALS

7,501,239,976  /home/njn/moz/rust0/src/libcore/slice/mod.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data
5,625,275,003  /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/ena-0.9.3/src/snapshot_vec.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data

The reset_unifications call within take_and_reset_data is the expensive part. It all boils down to set_all within the ena crate:

            .      pub fn set_all(&mut self, mut new_elems: impl FnMut(usize) -> D::Value) {
      185,003          if !self.in_snapshot() {
            .              for (slot, index) in self.values.iter_mut().zip(0..) {
5,625,090,000                  *slot = new_elems(index);
            .              }
            .          } else {
            .              for i in 0..self.values.len() {
            .                  self.set(i, new_elems(i));
            .              }
            .          }
            .      }

and iterator code (called from set_all):

            .      fn next(&mut self) -> Option<$elem> {
            .          // could be implemented with slices, but this avoids bounds checks
            .          unsafe {
            .              if mem::size_of::<T>() != 0 {
    1,067,186                  assume(!self.ptr.is_null());
       20,013                  assume(!self.end.is_null());
            .              }
7,621,400,804              if self.ptr == self.end {
            .                  None
            .              } else {
            .                  Some($mkref!(self.ptr.post_inc()))
            .              }
            .          }
            .      }

I did some measurement and found that, in the vast majority of cases, reset_unification is a no-op -- it overwrites the the unification table with the same values that it already has. I wonder if we could do better somehow. It's a shame we have to keep the unbound variables around rather than just clearing them like we do with the other data in take_and_reset_data. I know that this is an extreme example, but profiles indicate that reset_unifications is somewhat hot on more normal programs too. @nikomatsakis, any ideas?

In the opt builds, these are the top functions according to Cachegrind:

555,228,191,235  PROGRAM TOTALS

87,515,336,354  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*)
41,641,142,772  /home/njn/moz/rust0/src/llvm/lib/Analysis/CaptureTracking.cpp:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
35,396,524,596  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::SmallDenseMap<llvm::Instruction const*, unsigned int, 32u, llvm::DenseMapInfo<llvm::Instruction const*>, llvm::detail::DenseMapPair<llvm::Instruction const*, unsigned int> >::grow(unsigned int) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
33,096,763,980  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::getAttributes(unsigned int) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
30,008,031,294  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::OrderedBasicBlock::comesBefore(llvm::Instruction const*, llvm::Instruction const*)
29,931,802,152  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::hasAttribute(unsigned int, llvm::Attribute::AttrKind) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]

That's a lot of time in PointerMayBeCaptured.

nikic commented 6 years ago

I've opened an issue for the LLVM part upstream: https://bugs.llvm.org/show_bug.cgi?id=37588

nnethercote commented 6 years ago

Thank you, @nikic! That's extremely helpful.

dotdash commented 5 years ago

This seems to benefit a lot from ~#57315~ #57351

A non-incremental build of rustc with LTO enabled takes 20s for a debug build, while a rustc with the mentioned PR that was built in incremental mode takes "only" 9.4s. Building a LTO'd version of that rustc now...

dotdash commented 5 years ago

LTO'd rustc with ~#57315~ #57351 takes 3.1s for me to produce a debug build.

jonas-schievink commented 5 years ago

@dotdash Are you sure that's the right PR? It's a Cargo update.

dotdash commented 5 years ago

oops, typo'd, it's #57351

dotdash commented 5 years ago

Thinking about it, given that the test case originally only took 4.5s, I suspect that the const_eval part might have been a regression that was introduced in the meantime.

workingjubilee commented 6 months ago

Either my machine is that much of a beast, or this was simply fixed.

$ time rustc helloworld5000.rs

real    0m0.423s
user    0m0.384s
sys     0m0.039s
$ time rustc -Copt-level=3 helloworld5000.rs

real    0m1.724s
user    0m1.579s
sys     0m0.144s