Open Marwes opened 6 years ago
cc @michaelwoerister @arielb1 @nagisa
It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.
The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
What happens if you do RUSTFLAGS=-Ccodegen-units=1 CARGO_INCREMENTAL=0 cargo bench
?
cc @alexcrichton
(These are on a different machine so numbers won't be comparable to the ones above)
test http_requests_large ... bench: 247,219 ns/iter (+/- 68,700)
test http_requests_small ... bench: 51,941 ns/iter (+/- 9,127)
test http_requests_large ... bench: 325,950 ns/iter (+/- 48,330)
test http_requests_small ... bench: 62,812 ns/iter (+/- 12,001)
Without specifying Ccodegen-units=1
test http_requests_large ... bench: 420,282 ns/iter (+/- 75,396)
test http_requests_small ... bench: 85,946 ns/iter (+/- 13,067)
test http_requests_large ... bench: 373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench: 78,019 ns/iter (+/- 14,368)
So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .
It is expected that incrementally compiled code performs worse at runtime. That's the trade-off you make. I would not use it for benchmarks.
Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)
The recent addition of ThinLTO is likely to also affect runtime performance of specific benchmarks.
Yeah I figured ThinLTO might be one of the culprits but I don't believe there is a way to use full LTO atm (https://github.com/rust-lang/rust/pull/47521). Depending on how much ThinLTO affects the runtimes though that might still be a problem however.
247 vs 325 microseconds is indeed quite the slowdown. I would be interesting to see where all the additional time goes. Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.
Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?
So a single codegen unit seems to help a bit but it is still slower than it could be. I thought about codegen units briefly when testing but never tested it as the cargo docs seem to imply that they are already set to 1 :/ https://doc.rust-lang.org/cargo/reference/manifest.html#the-profile-sections .
Yes, that is out of date since a few weeks ago, I think. cc @alexcrichton & @rust-lang/cargo
Yeah that was just my bad, forgot I had it enabled when running the benchmarks... (would be nice if it wasn't a 2x slowdown still but I get that might not be possible to improve much)
At some point we'll probably have incremental ThinLTO, at which point runtime performance should be quite a bit better. However, I'm not sure if that will ever be the default since it will cause longer compile times than regular incremental compilation. Incremental compilation strives to give you code that is "fast enough" and otherwise clearly prioritizes short compile times.
Since this is just one data point, it's hard to even guess. Running the code in a profiler would be most insightful, I think.
Tried gleaning something from profiler output I didn't spot anything as the runtime is spread out in a lot of different places and inlining tears through the code completely. I may give it another shot but I don't expect much.
Also, @eddyb, hasn't there been some kind of regression in the benchmarking framework that could also play into this?
It is unlikely to do with the benchmark framework. I am comparing these numbers to https://github.com/Geal/nom_benchmarks/blob/master/http/nom-http/src/main.rs which also uses it and that benchmark hasn't regressed. (I actually use https://github.com/bluss/bencher/ but switching to the builtin one shows no change in runtime).
Added another result to https://github.com/rust-lang/rust/issues/47561#issuecomment-358991665 . Forcing full lto helps a bit but is not the sole problem.
test http_requests_large ... bench: 373,739 ns/iter (+/- 71,143)
test http_requests_small ... bench: 78,019 ns/iter (+/- 14,368)
I've checked this from 1.13.0 to the current nightly and it is obvious that the regression is introduced in 1.19.0 and 1.20.0. The speed is recovering in more recent version (with CGU=1 and ThinLTO=off) but still haven't come back to the 1.19.0 level.
@kennytm
Cool. Could you try to bisect to the specific commit? I might then take a go at investigating it.
@arielb1 Probably not to a specific commit, but I'm now doing a date-based bisection which should point to a narrow-enough commit range.
From the timing of nightlies between 2017-04-24 to 2017-07-17:
nightly-2017-04-26
(+43µs): 63c77214c1d38789652b465694b254205d1886e0...2b4c911581099e247a68b3a8adc782d778f5190b (likely #40123 (LLVM 4.0 Upgrade))nightly-2017-05-14
(+26µs): e17a1227ac779a181b2839998e26a7e4e434c2a0...826d8f3850b37a23481dfcf4a899b5dfc82d22e3 (likely #41920 (remove the #[inline] attribute from drop_in_place))nightly-2017-06-21
(+36µs): 04145943a25c3b8c7e7d7fe8c2efb04f259c25fb...445077963c55297ef1e196a3525723090fe80b22 (???)nightly-2017-06-23
(+22µs): 622e7e6487b6fb7fdbb901720cd4214f9179ed67...ab5bec25530aac43dfd64384b405c909b6e405e3 (likely #42771 (mark calls in the unwind path as !noinline))nightly-2017-07-07
(-48µs): 3610a70ce488953c5b0379fece70f2baad30a825...696412de7e4e119f8536686c643621115b90c775 (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))Regression at nightly-2017-04-26 (+43µs): 63c7721...2b4c911 (likely #40123 (LLVM 4.0 Upgrade))
Not much we can do about that without a fairly deep investigation. We're about to update to LLVM 5.0 or 6.0, so let's hope that fixes the regression
Regression at nightly-2017-06-21 (+36µs): 0414594...4450779 (???) Improvement at nightly-2017-07-07 (-48µs): 3610a70...696412d (likely #42899 (Switch to rust-lang-nursery/compiler-builtins) or #42816 (rustc: Implement stack probes for x86))
This looks like #42313 and #42727 - the first try at allocator integration had some perf issues - missing inline
attribute, which were fixed by #42727.
Regression at nightly-2017-05-14 (+26µs): e17a122...826d8f3 (likely #41920 (remove the #[inline] attribute from drop_in_place)) Regression at nightly-2017-06-23 (+22µs): 622e7e6...ab5bec2 (likely #42771 (mark calls in the unwind path as !noinline))
This sort of makes sense, but is weird. I might look into it
@arielb1 A regression from #41920 seems consistent with my observations. I have another variant of the large
benchmark above where the error type does not need drop (the normal error contains a Vec
). The version that contains that benchmarks requires a recent version of rust so I am not able to compare the results but it does appear to have regressed less than one that needs drop.
triage: P-medium
Triage; It's been a while. @Marwes lots has changed since you've filed this issue; do you still track this stuff on combine?
Reran on the old branch so I could use 1.19
name 1.19 ns/iter 1.39 ns/iter diff ns/iter diff % speedup
http_requests_large 525,200 629,625 104,425 19.88% x 0.83
http_requests_small 104,955 123,803 18,848 17.96% x 0.85
Regression still seems to exist.
The regression still exists, but I think I've been able to root-cause this to some kind of change in LLVM's inlining (in combination with codegen-units?).
Poking through the benchmark executables with perf annotate
indicates that previously after LLVM was done we had about 3 functions left, and now for whatever reason we have a lot more than 3, and some look suspiciously small to me (i.e. they fit on a screen but are a lot more than a trivial getter/setter), but LTO has little effect so the problem is not that these functions are not eligible for inlining, the problem is that the heuristics are concluding they shouldn't be inlined.
If I compile now on nightly:
rustc 1.66.0-nightly (8796e7a9c 2022-10-08)
binary: rustc
commit-hash: 8796e7a9cfd4c5c4f1de15ec1c53994ddf288665
commit-date: 2022-10-08
host: x86_64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.2
With
RUSTFLAGS="-Ccodegen-units=1 -Cllvm-args=-inline-threshold=100000" cargo bench
I can beat the old benchmark times. All you need is a sufficiently large inline threshold, and the threshold above is high enough to inline the whole benchmark into one function.
This regression still exists, but I don't know what action we could take here. I don't think there is some particular pattern which isn't being optimized here which we could implement logic for in LLVM. It seems likely that tuning the inliner heuristic(s) in order to benefit these benchmarks would pessimize others.
@rustbot label +I-prioritize
I'm sifting through old issues. I think this situation should be discussed because I'm not sure there is anything for us to do here on account of this being an inlining heuristic issue, and also this issue's age. It's been 4 years and 9 or 10 LLVM releases since this issue was opened. What are the odds we manage to fix this regression without creating other issues?
@rustbot label -I-prioritize +I-compiler-nominated
"What are the odds we manage to fix this regression without creating other issues?"
What kind of rhetoric is that ?. It's only looking for excuses due to something that happens to be complex. '
On Mon, 10 Oct 2022 at 14:05, Ben Kimock @.***> wrote:
I'm sifting through old issues. I think this situation should be discussed because I'm not sure there is anything for us to do here on account of this being an inlining heuristic issue, and also this issue's age. It's been 4 years and 9 or 10 LLVM releases since this issue was opened. What are the odds we manage to fix this regression without creating other issues?
@rustbot https://github.com/rustbot label -I-prioritize +I-compiler-nominated
— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/47561#issuecomment-1273213165, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABCY4KAQYR7PYS7WQ3QJN3WCQA7BANCNFSM4EMPJ45A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
issue discussed during last week's T-compiler meeting (notes). Follow-up on this Zulip topic
@rustbot label -I-compiler-nominated
Started on an update to https://github.com/Marwes/combine after it being dormant for a while. When I ran the benchmarks to check that my changes hadn't regressed the performance I noticed that performance had regressed by ~28% (~116% with incremental compilation!) since the last time ran benchmarks (somewhere around September 2016).
I ran the benchmarks again against an old version of the library to be able to compile it with older rustc's but the regression is the same in the master branch as well.
cargo bench --bench http
against https://github.com/Marwes/combine/tree/v2.3.2cargo-0.18.0-nightly (a73a665 2017-02-14)
rustc 1.19.0-nightly (554c685b0 2017-06-14)
rustc 1.22.0-nightly (368122087 2017-09-06)
rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=0)
rustc 1.25.0-nightly (6828cf901 2018-01-06) (CARGO_INCREMENTAL=1)
I'd like to bisect this further but the two tools I found for this do not appear to work in this case, is there any other tool that can be used for this?
https://github.com/kamalmarhubi/rust-bisect (Outdated) https://github.com/Mark-Simulacrum/bisect-rust/tree/master/src (Only last 90 days)