Open dantheman3333 opened 10 months ago
The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.
Those perf reports are about how fast the compiler runs, which often has little to do with code generation quality. So perf reports and perf trigage are not going to be relevant here. There are now some runtime benchmarks, but very few so far.
I'm looking into your benchmark.
I think the cause of the regression is inlining of calls to std::io::Error::new
. If this were my code, I would resolve the perf regression by not using std::io::Error
the way you are. It's a very expensive type to construct and to drop. For this codebase, nearly all parsing errors can be reported just as accurately by an enum with unit variants, so I'd be trying to apply that strategy where possible.
This issue surfaces here because that patch made the MIR inliner slightly more aggressive; not because actually looking at the size of locals was a useful heuristic. -Zmir-inline-threshold=40
improves performance on this benchmark (the default is 50) and we no longer inline std::io::Error::new
in the final binary.
Perhaps this indicates it is appropriate to put #[inline(never)]
on std::io::Error::new
.
Demo of the problematic MIR inlining choice: https://godbolt.org/z/5vseT5oqb
I'm tagging @rust-lang/wg-mir-opt: Does this indicate a need for a new MIR inlining heuristic?
And libs, but really just @thomcc: Do you think it is a good idea to add #[inline(never)]
to std::io::Error::new
, or do you think that is too heavy-handed?
I think putting it on Error::_new
would be more reasonable, but I could be convinced about Error::new`.
@saethlin thank you!
Adding #[inline(never)]
to Error::_new
would only protect one of the allocation codepaths from being inlined. Why do you think putting it on that function is better?
Because it's intentionally outlined for that purpose. My concern with making Error::new inline(never) is it could be monomorphized several times. In many cases, we'd want those to be inlined.
hey @saethlin, just got around to trying replacing all of the expensive errors in the hot loop, but I am still seeing significant differences.
With RUSTFLAGS="-Zinline-mir-threshold=40" cargo build --release
(has it been renamed from mir-inline-threshold
in newer nightlies?), and varying the threshold, I am still not seeing any improvement
Benchmark 1: /tmp/frost_170 info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 606.1 ms ± 2.3 ms [User: 338.9 ms, System: 267.1 ms]
Range (min … max): 602.9 ms … 608.9 ms 10 runs
Benchmark 2: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 914.3 ms ± 7.8 ms [User: 656.0 ms, System: 258.2 ms]
Range (min … max): 906.9 ms … 926.0 ms 10 runs
Benchmark 3: /tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 905.8 ms ± 6.2 ms [User: 641.9 ms, System: 263.8 ms]
Range (min … max): 898.8 ms … 915.0 ms 10 runs
Summary
'/tmp/frost_170 info frost/tests/fixtures/test_large.bag' ran
1.49 ± 0.01 times faster than '/tmp/frost_nightly_mir_0 info frost/tests/fixtures/test_large.bag'
1.51 ± 0.01 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'
Do you have any recommendations on how to proceed from here?
Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln!
expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln!
I still see the perf difference.
But anyway, I locally deleted them and re-bisected with this --script
:
RUSTFLAGS="-Zmir-opt-level=0" cargo b --release --bin frost -Zbuild-std --target=x86_64-unknown-linux-gnu
perf stat -r10 $CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/frost info frost/tests/fixtures/test_large.bag 2>&1 | grep "6..\... msec task-clock"
And I found:
searched toolchains nightly-2023-04-01 through nightly-2023-12-22
********************************************************************************
Regression in nightly-2023-06-27
********************************************************************************
...
found 8 bors merge commits in the specified range
commit[0] 2023-06-25: Auto merge of #113037 - TaKO8Ki:rollup-pqfbxwk, r=TaKO8Ki
commit[1] 2023-06-25: Auto merge of #113038 - matthiaskrgr:rollup-sdcfkxa, r=matthiaskrgr
commit[2] 2023-06-26: Auto merge of #111850 - the8472:external-step-by, r=scottmcm
commit[3] 2023-06-26: Auto merge of #112884 - klensy:ri-drop-old-clap, r=albertlarsan68
commit[4] 2023-06-26: Auto merge of #113024 - Jerrody:master, r=thomcc
commit[5] 2023-06-26: Auto merge of #113049 - matthiaskrgr:rollup-41wo5w8, r=matthiaskrgr
commit[6] 2023-06-26: Auto merge of #112887 - WaffleLapkin:become_unuwuable_in_hir, r=compiler-errors,Nilstrieb
commit[7] 2023-06-26: Auto merge of #113057 - TaKO8Ki:rollup-071lc9g, r=TaKO8Ki
ERROR: no CI builds available between 8084f397c6710e4748994dd1488b1f399283d469 and 36fb58e433c782e27dd41034284e157cf86d587f within last 167 days
I can see by eye that the hot code that gets inlined into FlatMap::next
is indeed different, though to my untrained eye it just looks like blocks of instructions have been moved around the function. There are 4 rollups which is really not encouraging, but I'm going to try locally reverting https://github.com/rust-lang/rust/pull/111850 to see if that moves us back to the faster codegen. But considering my bisection points at code that's 6 months old it's possible that the revert will be too challenging for me.
Hey, very much appreciate you doing another deep dive!
Your patch still has a huge amount of code associated with error reporting; you just moved it from constructing the error types themselves to eprintln! expansions. But that's just general advice, not particularly pertinent to this issue, because even after deleting all the eprintln! I still see the perf difference.
I tried to balance creating enum variants with the log messages, as I was trying to avoid an explosion of variants just to log specific messages (some folks appreciate the error messages as they were working on their own bag writing tool, so I wanted to keep the messages). Do you think for high performance there's no way around this, and I just need to continue making more specific variants for each message that I want to keep and eventually print out with fmt::Display?
Thanks for the perf stat cmd, it's much simpler than the original bisection script I used. Let me know how the revert goes :crossed_fingers:
I think errors should nearly always be rendered late because that moves the code size to error reporting and not error detection, which often has to be inside a hot code path. The goal is to minimize the impact on the happy path.
I haven't been able to revert, but by checking out the commit directly before that PR I linked and running the benchmarks on that and the merge commit for that PR, I can see that it is the cause of the regression. Which is interesting, considering it was supposed to make things faster. I suppose at this point it would make sense to ask @the8472; do you have any idea why https://github.com/rust-lang/rust/pull/111850 would have made something slower?
I suspect the answer is no, and in that case the program here needs to be reduced to a minimal program that exhibits slower performance on 1.71 than on 1.70 by exercising the StepBy
specialization. Then we could decide if the situation is resolvable in rustc or in LLVM.
do you have any idea why https://github.com/rust-lang/rust/pull/111850 would have made something slower?
No.
The PR does several things to StepBy
next()
implementation, which also simplifies the call treeTrustedLen
for integer ranges, this affects other specializationsOther than how it might affect inlining they should all be improvements.
At least that ticks off a possibility, thanks. Most likely LLVM is just tripping over its own feet here. I'll try reducing this at some point. I have an idea of how to start, but it'll be a slow process to confirm that a reduction still reproduces the important behavior.
Looking through the frost codebase I'm only seeing two uses of step_by
. Both have this pattern
slice
.windows(N)
.step_by(N)
.flat_map(... -> Result<_, _>)
.collect::<Vec<_>>()
The range specialization is not applicable here. And flat-mapping Result
makes TrustedLen not applicable. So it's likely the changed default impl next()
.
Previously it called the underlying next()
on the first iteration and then nth(N) on each subsequent one. Now it calls always nth()
with a computed value. And I only measured that specific change on loops that had a fixed iteration count.
It could be that the computed step size + less predictable loops make things more difficult for LLVM.
Replacing
with the old impl
if self.first_take {
self.first_take = false;
self.iter.next()
} else {
self.iter.nth(self.step)
}
might be worth a try.
If that's not it then it's inlining.
@dantheman3333, btw, you should be able to replace
slice
.windows(N)
.step_by(N)
with
slice
.chunks_exact(N)
--- a/frost/src/lib.rs
+++ b/frost/src/lib.rs
@@ -844,8 +844,7 @@ fn parse_chunk_info<R: Read + Seek>(
let data = get_lengthed_bytes(reader)?;
let chunk_info_data: Vec<ChunkInfoData> = data
- .windows(8)
- .step_by(8)
+ .chunks_exact(8)
.flat_map(ChunkInfoData::from)
.collect();
@@ -866,8 +865,7 @@ fn parse_index<R: Read + Seek>(
let data = get_lengthed_bytes(reader)?;
let index_data: Vec<IndexData> = data
- .windows(12)
- .step_by(12)
+ .chunks_exact(12)
.flat_map(|buf| IndexData::from(buf, chunk_header_pos, index_data_header.connection_id))
.collect();
$ hyperfine -w 5 "/tmp/frost_174 info frost/tests/fixtures/test_large.bag" "/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag"
Benchmark 1: /tmp/frost_174 info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 902.8 ms ± 7.5 ms [User: 654.2 ms, System: 248.5 ms]
Range (min … max): 893.5 ms … 908.9 ms 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: /tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag
Time (mean ± σ): 426.6 ms ± 0.4 ms [User: 180.3 ms, System: 246.2 ms]
Range (min … max): 426.0 ms … 427.1 ms 10 runs
Summary
'/tmp/frost_174_chunks info frost/tests/fixtures/test_large.bag' ran
2.12 ± 0.02 times faster than '/tmp/frost_174 info frost/tests/fixtures/test_large.bag'
wow, just this change produced much performant code, thank you @the8472 !
Hi, on linux x86, I'm seeing a 120+ms regression in wall time with 1.71, which is small but is noticeable when running the cli application as it is already quite fast.
The PR in question does have a perf signoff after weighing the trade-off, but I am raising the issue in case further investigation could lead to more optimizations.
The project is pretty messy but I have steps on how to benchmark here https://github.com/dantheman3333/frost/issues/33.
A differential flamegraph shows flatmap's next has taken a hit. I am unsure if this is an aggregate view of all of the nexts or if this is a specific call site - is it possible to get line numbers from this?
I am unsure of how to proceed from here but I would love to learn if anyone has tips