Performance regression in Rust 1.71

dantheman3333 commented 10 months ago

Hi, on linux x86, I'm seeing a 120+ms regression in wall time with 1.71, which is small but is noticeable when running the cli application as it is already quite fast.

1.70:

frost$ hyperfine -w 5 "./target/release/frost info frost/tests/fixtures/test_large.bag"
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     580.8 ms ±   0.9 ms    [User: 311.6 ms, System: 269.4 ms]
  Range (min … max):   579.7 ms … 582.2 ms    10 runs

1.71:
Benchmark 1: ./target/release/frost info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     708.8 ms ±   4.9 ms    [User: 447.1 ms, System: 261.6 ms]
  Range (min … max):   703.3 ms … 716.7 ms    10 runs

********************************************************************************
Regression in 915aa06700af4a2363639bae70201cd7387470ad
********************************************************************************

Attempting to search unrolled perf builds
ERROR: couldn't find perf build comment
==================================================================================
= Please file this regression report on the rust-lang/rust GitHub repository     =
=        New issue: https://github.com/rust-lang/rust/issues/new                 =
=     Known issues: https://github.com/rust-lang/rust/issues                     =
= Copy and paste the text below into the issue report thread.  Thanks!           =
==================================================================================

searched nightlies: from nightly-2023-01-01 to nightly-2023-09-16
regressed nightly: nightly-2023-04-24
searched commit range: https://github.com/rust-lang/rust/compare/b628260df0587ae559253d8640ecb8738d3de613...7f94b314cead7059a71a265a8b64905ef2511796
regressed commit: https://github.com/rust-lang/rust/commit/915aa06700af4a2363639bae70201cd7387470ad

<details>
<summary>bisected with <a href='https://github.com/rust-lang/cargo-bisect-rustc'>cargo-bisect-rustc</a> v0.6.7</summary>

Host triple: x86_64-unknown-linux-gnu
Reproduce with:
```bash
cargo bisect-rustc --script=./bench.sh --start=2023-01-01 --end=2023-09-16 --preserve

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

The project is pretty messy but I have steps on how to benchmark here https://github.com/dantheman3333/frost/issues/33.

A differential flamegraph shows flatmap's next has taken a hit. I am unsure if this is an aggregate view of all of the nexts or if this is a specific call site - is it possible to get line numbers from this?

I am unsure of how to proceed from here but I would love to learn if anyone has tips

saethlin commented 10 months ago

The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.

Those perf reports are about how fast the compiler runs, which often has little to do with code generation quality. So perf reports and perf trigage are not going to be relevant here. There are now some runtime benchmarks, but very few so far.

I'm looking into your benchmark.

saethlin commented 10 months ago

I think the cause of the regression is inlining of calls to std::io::Error::new. If this were my code, I would resolve the perf regression by not using std::io::Error the way you are. It's a very expensive type to construct and to drop. For this codebase, nearly all parsing errors can be reported just as accurately by an enum with unit variants, so I'd be trying to apply that strategy where possible.

This issue surfaces here because that patch made the MIR inliner slightly more aggressive; not because actually looking at the size of locals was a useful heuristic. -Zmir-inline-threshold=40 improves performance on this benchmark (the default is 50) and we no longer inline std::io::Error::new in the final binary.

Perhaps this indicates it is appropriate to put #[inline(never)] on std::io::Error::new.

saethlin commented 10 months ago

Demo of the problematic MIR inlining choice: https://godbolt.org/z/5vseT5oqb

saethlin commented 10 months ago

I'm tagging @rust-lang/wg-mir-opt: Does this indicate a need for a new MIR inlining heuristic? And libs, but really just @thomcc: Do you think it is a good idea to add #[inline(never)] to std::io::Error::new, or do you think that is too heavy-handed?

thomcc commented 10 months ago

I think putting it on Error::_new would be more reasonable, but I could be convinced about Error::new`.

dantheman3333 commented 10 months ago

@saethlin thank you!

saethlin commented 10 months ago

Adding #[inline(never)] to Error::_new would only protect one of the allocation codepaths from being inlined. Why do you think putting it on that function is better?

thomcc commented 10 months ago

Because it's intentionally outlined for that purpose. My concern with making Error::new inline(never) is it could be monomorphized several times. In many cases, we'd want those to be inlined.

dantheman3333 commented 7 months ago

hey @saethlin, just got around to trying replacing all of the expensive errors in the hot loop, but I am still seeing significant differences.

With RUSTFLAGS="-Zinline-mir-threshold=40" cargo build --release (has it been renamed from mir-inline-threshold in newer nightlies?), and varying the threshold, I am still not seeing any improvement

Benchmark 1: /tmp/frost_170 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     606.1 ms ±   2.3 ms    [User: 338.9 ms, System: 267.1 ms]
  Range (min … max):   602.9 ms … 608.9 ms    10 runs

Benchmark 2: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     914.3 ms ±   7.8 ms    [User: 656.0 ms, System: 258.2 ms]
  Range (min … max):   906.9 ms … 926.0 ms    10 runs

Benchmark 3: /tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     905.8 ms ±   6.2 ms    [User: 641.9 ms, System: 263.8 ms]
  Range (min … max):   898.8 ms … 915.0 ms    10 runs

Summary
  '/tmp/frost_170 info frost/tests/fixtures/test_large.bag' ran
    1.49 ± 0.01 times faster than '/tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag'
    1.51 ± 0.01 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

Do you have any recommendations on how to proceed from here?

saethlin commented 7 months ago

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

But anyway, I locally deleted them and re-bisected with this --script:

RUSTFLAGS="-Zmir-opt-level=0" cargo b --release --bin frost -Zbuild-std --target=x86_64-unknown-linux-gnu
perf stat -r10 $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/frost info frost/tests/fixtures/test_large.bag 2>&1 | grep "6..\... msec task-clock"

And I found:

searched toolchains nightly-2023-04-01 through nightly-2023-12-22

********************************************************************************
Regression in nightly-2023-06-27
********************************************************************************
...
found 8 bors merge commits in the specified range
  commit[0] 2023-06-25: Auto merge of #113037 - TaKO8Ki:rollup-pqfbxwk, r=TaKO8Ki
  commit[1] 2023-06-25: Auto merge of #113038 - matthiaskrgr:rollup-sdcfkxa, r=matthiaskrgr
  commit[2] 2023-06-26: Auto merge of #111850 - the8472:external-step-by, r=scottmcm
  commit[3] 2023-06-26: Auto merge of #112884 - klensy:ri-drop-old-clap, r=albertlarsan68
  commit[4] 2023-06-26: Auto merge of #113024 - Jerrody:master, r=thomcc
  commit[5] 2023-06-26: Auto merge of #113049 - matthiaskrgr:rollup-41wo5w8, r=matthiaskrgr
  commit[6] 2023-06-26: Auto merge of #112887 - WaffleLapkin:become_unuwuable_in_hir, r=compiler-errors,Nilstrieb
  commit[7] 2023-06-26: Auto merge of #113057 - TaKO8Ki:rollup-071lc9g, r=TaKO8Ki
ERROR: no CI builds available between 8084f397c6710e4748994dd1488b1f399283d469 and 36fb58e433c782e27dd41034284e157cf86d587f within last 167 days

I can see by eye that the hot code that gets inlined into FlatMap::next is indeed different, though to my untrained eye it just looks like blocks of instructions have been moved around the function. There are 4 rollups which is really not encouraging, but I'm going to try locally reverting https://github.com/rust-lang/rust/pull/111850 to see if that moves us back to the faster codegen. But considering my bisection points at code that's 6 months old it's possible that the revert will be too challenging for me.

dantheman3333 commented 7 months ago

Hey, very much appreciate you doing another deep dive!

Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.

I tried to balance creating enum variants with the log messages, as I was trying to avoid an explosion of variants just to log specific messages (some folks appreciate the error messages as they were working on their own bag writing tool, so I wanted to keep the messages). Do you think for high performance there's no way around this, and I just need to continue making more specific variants for each message that I want to keep and eventually print out with fmt::Display?

Thanks for the perf stat cmd, it's much simpler than the original bisection script I used. Let me know how the revert goes :crossed_fingers:

saethlin commented 7 months ago

I think errors should nearly always be rendered late because that moves the code size to error reporting and not error detection, which often has to be inside a hot code path. The goal is to minimize the impact on the happy path.

I haven't been able to revert, but by checking out the commit directly before that PR I linked and running the benchmarks on that and the merge commit for that PR, I can see that it is the cause of the regression. Which is interesting, considering it was supposed to make things faster. I suppose at this point it would make sense to ask @the8472; do you have any idea why https://github.com/rust-lang/rust/pull/111850 would have made something slower?

I suspect the answer is no, and in that case the program here needs to be reduced to a minimal program that exhibits slower performance on 1.71 than on 1.70 by exercising the StepBy specialization. Then we could decide if the situation is resolvable in rustc or in LLVM.

the8472 commented 7 months ago

do you have any idea why https://github.com/rust-lang/rust/pull/111850 would have made something slower?

No.

The PR does several things to StepBy

changes the default next() implementation, which also simplifies the call tree
specializes for unsigned integer ranges. this deepens the call-tree in all cases.
implements TrustedLen for integer ranges, this affects other specializations

Other than how it might affect inlining they should all be improvements.

saethlin commented 7 months ago

At least that ticks off a possibility, thanks. Most likely LLVM is just tripping over its own feet here. I'll try reducing this at some point. I have an idea of how to start, but it'll be a slow process to confirm that a reduction still reproduces the important behavior.

the8472 commented 7 months ago

Looking through the frost codebase I'm only seeing two uses of step_by. Both have this pattern

slice
  .windows(N)
  .step_by(N)
  .flat_map(... -> Result<_, _>)
  .collect::<Vec<_>>()

The range specialization is not applicable here. And flat-mapping Result makes TrustedLen not applicable. So it's likely the changed default impl next(). Previously it called the underlying next() on the first iteration and then nth(N) on each subsequent one. Now it calls always nth() with a computed value. And I only measured that specific change on loops that had a fixed iteration count.

It could be that the computed step size + less predictable loops make things more difficult for LLVM.

Replacing

https://github.com/rust-lang/rust/blob/625c2c401fae15dc3484cb25313ddff39b7e016c/library/core/src/iter/adapters/step_by.rs#L204-L208

with the old impl

        if self.first_take {
            self.first_take = false;
            self.iter.next()
        } else {
            self.iter.nth(self.step)
        }

might be worth a try.

If that's not it then it's inlining.

the8472 commented 7 months ago

@dantheman3333, btw, you should be able to replace

slice
  .windows(N)
  .step_by(N)

with

slice
  .chunks_exact(N)

dantheman3333 commented 7 months ago

--- a/frost/src/lib.rs
+++ b/frost/src/lib.rs
@@ -844,8 +844,7 @@ fn parse_chunk_info<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;

     let chunk_info_data: Vec<ChunkInfoData> = data
-        .windows(8)
-        .step_by(8)
+        .chunks_exact(8)
         .flat_map(ChunkInfoData::from)
         .collect();

@@ -866,8 +865,7 @@ fn parse_index<R: Read + Seek>(
     let data = get_lengthed_bytes(reader)?;

     let index_data: Vec<IndexData> = data
-        .windows(12)
-        .step_by(12)
+        .chunks_exact(12)
         .flat_map(|buf| IndexData::from(buf, chunk_header_pos, index_data_header.connection_id))
         .collect();

$ hyperfine -w 5 "/tmp/frost_174 info frost/tests/fixtures/test_large.bag" "/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag"
Benchmark 1: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     902.8 ms ±   7.5 ms    [User: 654.2 ms, System: 248.5 ms]
  Range (min … max):   893.5 ms … 908.9 ms    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: /tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag
  Time (mean ± σ):     426.6 ms ±   0.4 ms    [User: 180.3 ms, System: 246.2 ms]
  Range (min … max):   426.0 ms … 427.1 ms    10 runs

Summary
  '/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag' ran
    2.12 ± 0.02 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'

wow, just this change produced much performant code, thank you @the8472 !

rust-lang / rust

Performance regression in Rust 1.71 #115917