rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.78k stars 12.65k forks source link

Severe perf regression in optimized debug builds due to extra UB checks #121245

Closed RalfJung closed 6 months ago

RalfJung commented 8 months ago

In Miri built with optimized and debug assertions, we have recently seen a 50% increase in CI times. Bisecting points at https://github.com/rust-lang/rust/pull/120594, and @saethlin confirmed that disabling one of these UB checks gives a speedup of the same ballpark as that slowdown.

Several things could be done here:

Update (2024-02-25): CI times are back to normal, now only one specific benchmark is affected.

apiraino commented 8 months ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-high

clarfonthey commented 7 months ago

Just want to inject a few thoughts here as I work on #117494, which (as mentioned in that PR), will probably exacerbate this issue.

First, I noticed that there seems to be a few issues with double-checks getting optimized out for checked methods, as demonstrated in my analysis in this comment. This could be a potential factor in the performance issue, either by making the optimiser work harder to remove them (increasing compile times) or by them being run twice because the optimiser failed to remove them. To clarify, I mean cases like this not being optimised properly:

fn checked_thing(...) {
    if condition(...) {
        unsafe { unchecked_thing(...) }
    } else {
        // ...
    }
}

fn unchecked_thing(...) {
    debug_assert_nounwind!(condition(...), ...);
    // ...
}

Note that in some cases, the two conditions might be logically equivalent but not exactly the same code. (In the example of unchecked_shX, the two conditions were rhs < Self::BITS and rhs & (Self::BITS - 1) < Self::BITS.)


Second, I want to voice support for just making these unsafe asserts into their own dedicated flag that can be disabled, like debug_assertions, but technically separate to it. If we additionally offered this to downstream crate authors, it could be a useful tool to enable for debugging when something goes wrong, but to be disabled if it causes too much stress on testing (like here). Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

RalfJung commented 7 months ago

This could be a potential factor in the performance issue, either by making the optimiser work harder to remove them (increasing compile times) or by them being run twice because the optimiser failed to remove them.

This issue is about the second point: performance of rustc-generated code, not compile times.

Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

How do you set compiler flags on a per-test basis...?

clarfonthey commented 7 months ago

Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

How do you set compiler flags on a per-test basis...?

I mostly meant in the sense of like, having them enabled when testing libraries and not on the larger compiler tests, since generally those have separate test suites. But also, the compiler tests themselves tend to set compiler flags when they're running, so you could configure them there.

Basically, not within a single crate, but for multi-crate projects like the compiler and for compiler tests specifically.

Noratrieb commented 7 months ago

121114 will fix or at least significantly help with this. The checks are currently wayyy slower than they need to be because they are outlined for compile time reasons.

RalfJung commented 7 months ago

@Nilstrieb you linked this issue to itself? I assume you mean https://github.com/rust-lang/rust/pull/121114?

Noratrieb commented 7 months ago

yes, i don't think the issue will fix itself

RalfJung commented 7 months ago

With https://github.com/rust-lang/rust/pull/121114, CI times are indeed back to normal. :tada:

However, some individual benchmarks still show a severe regression. Specifically, CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked:

before (Miri 5f25162abf547fa3d306d2bdd7277f7a034fce5f):

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      3.810 s ±  0.030 s    [User: 3.758 s, System: 0.048 s]
  Range (min … max):    3.785 s …  3.858 s    5 runs

today:

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      5.766 s ±  0.006 s    [User: 5.723 s, System: 0.040 s]
  Range (min … max):    5.757 s …  5.773 s    5 runs

That's a 50% slowdown. Would be interesting to know if there's one particular check causing this or if it's everywhere. Interestingly, two other benchmarks I tried are completely unaffected, so it seems likely that this is one particular check.

saethlin commented 7 months ago

CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked:

slice-get-unchecked is a Stack cache stress-test. This benchmark scenario with debug assertions enabled now spends 88% of its runtime validating the Stacked Borrows cache. And using perf to jump into that function, there are no hot preconditions checks. In addition, I've been able to induce swings between -1 and +3 seconds of runtime by deleting parts of that function.

The hot part of verify_cache_consistency (95% of samples in that function) is this loop: https://github.com/rust-lang/rust/blob/0ecbd0605770f45c9151715e66ba2b3cae367fcb/src/tools/miri/src/borrow_tracker/stacked_borrows/stack.rs#L160-L169 which even with debug assertions is compiled to what looks to me like the ideal sequence of instructions:

 16.00 │100:┌─→inc    %r10
 15.93 │    │  add    $0xfffffffffffffff8,%r9
 30.41 │    │↓ je     17d
 16.90 │109:├──testb  $0x60,0x7(%r8,%r10,8)
 15.96 │    └──jne    100

I cannot explain why adding checks around this loop is making the program faster or slower. I tried reducing the number of checks by doing a let borrows = self.borrows.as_slice() then replacing all mention of self.borrows with borrows to deduplicate the Vec::deref checks, but that made the benchmark a tiny bit faster, except that I missed the use in the formatting of the assert! and when I replaced that use of self.borrows, the benchmark got slower. I tried deleting the helpful formatting in the assert! of the hot loop, and that also made the benchmark slower.

My only guess is that this benchmark is exceedingly sensitive to code alignment. Or something like that. In any case, the fact that ~84% of the program's cycles are in a perfectly-optimized loop that doesn't contain any checks makes me doubt the causal link between the checks and the slowdown.

RalfJung commented 7 months ago

~84% of the program's cycles are in a perfectly-optimized loop that doesn't contain any checks

How can that be true when the program gets 33% faster by removing the checks? Is that 84% of the cycles in the slow or the fast version of the program?

RalfJung commented 7 months ago

I tried deleting the helpful formatting in the assert! of the hot loop, and that also made the benchmark slower.

I can't reproduce this. I applied this diff on old Miri (before your checks were added)

diff --git a/src/borrow_tracker/stacked_borrows/stack.rs b/src/borrow_tracker/stacked_borrows/stack.rs
index 712c26a9..d107bd9a 100644
--- a/src/borrow_tracker/stacked_borrows/stack.rs
+++ b/src/borrow_tracker/stacked_borrows/stack.rs
@@ -161,9 +161,7 @@ impl<'tcx> Stack {
             if item.perm() == Permission::Unique {
                 assert!(
                     self.unique_range.contains(&idx),
-                    "{:?} {:?}",
-                    self.unique_range,
-                    self.borrows
+                    "bad cache",
                 );
             }
         }

Performance is the same with and without this.

saethlin commented 7 months ago

I ran

git checkout 5f25162abf547fa3d306d2bdd7277f7a034fce5f
./miri toolchain
CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked

And I get:

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      6.261 s ±  0.019 s    [User: 6.193 s, System: 0.058 s]
  Range (min … max):    6.239 s …  6.281 s    5 runs

With the same profile structure: ~84% of cycles are in the verify_cache_consistency loop that walks the entire stack.

But that's on my x86_64 desktop. On an M1 I see 6.8 seconds now and 5.6 seconds on commit 5f25162abf547fa3d306d2bdd7277f7a034fce5f. Still not as dramatic as the change you're seeing... What kind of machine are you on?

RalfJung commented 7 months ago

And I get:

And then when you apply just the diff I showed above it gets slower? Or what was your diff for "deleting the helpful formatting in the assert! of the hot loop"?

What kind of machine are you on?

A Lenovo laptop with an i7-12700H.

saethlin commented 7 months ago

Or what was your diff for "deleting the helpful formatting in the assert! of the hot loop"?

diff --git a/src/borrow_tracker/stacked_borrows/stack.rs b/src/borrow_tracker/stacked_borrows/stack.rs
index 712c26a9..d107bd9a 100644
--- a/src/borrow_tracker/stacked_borrows/stack.rs
+++ b/src/borrow_tracker/stacked_borrows/stack.rs
@@ -161,9 +161,7 @@ impl<'tcx> Stack {
             if item.perm() == Permission::Unique {
                 assert!(
                     self.unique_range.contains(&idx),
-                    "{:?} {:?}",
-                    self.unique_range,
-                    self.borrows
                 );
             }
         }
RalfJung commented 7 months ago

Yeah that diff makes no perf difference for me. If anything I am seeing a slight speedup (without: 3.806 s ± 0.014 s, with patch: 3.785 s ± 0.006 s).

apiraino commented 7 months ago

assigning to Nils who is authoring #121114 to signal that this issue is being taken care of. Thanks!

@rustbot assign @Nilstrieb

RalfJung commented 6 months ago

Fixed by https://github.com/rust-lang/rust/pull/123411 -- now one can disable the UB checks if they are too expensive. At least on nightly. Tracking issue: https://github.com/rust-lang/rust/issues/123499.