Open GuillaumePressiat opened 3 months ago
is this feedback something you've encountered before?
Not sure what you're asking. Yes, we do get performance regression reports and where possible we do want to fix them. But your report is not detailed enough to tell if it's similar to any existing issue.
Maybe it's just normal as 1.81 is a beta.
Beta builds are not expect to be slower than stable builds.
If not, I can make a reproducible example I think.
Yes, that would be necessary to narrow down what's causing it.
Yes, that would be necessary to narrow down what's causing it.
I'm on it!
Thanks for your response.
Here is a repo with a comparison in a reproducible manner (as a rust project..).
https://github.com/GuillaumePressiat/rust_fwf_polars_example
I hope it's fine like this
Do you want some export of samply profile?
hey @GuillaumePressiat thanks for reporting and bulding a sample of your code. It depends heavily on polars so I wish there a way to reduce polars dependencies and compile a real minimal piece of code to point us to the rustc regression.
For this we tend to use our cargo-bisect-rustc tool to isolate where the regression started but in this case I think it's a bit a long process.
@rustbot label +I-slow +E-needs-mcve
Hi @apiraino
Thanks for your answer.
Do you mean that I have to isolate where the regression started using cargo-bisect-rustc (and lighten the reproducible example without polars) or it's a task that you'll do? To be clear on what I have to do to go beyond.
Anyone having time and motivation :-) can attempt to reduce the code to an actionable MCVE, then use that code to bisect the compiler using cargo-bisect-rustc and try narrowing down where the regression started. So, yes, if you want you can help, that would be greatly appreciated!
I did a quick test with your code and on my machine the difference between nightly and stable compiler doesn't look that different (both are around ~100ms), so the bisection would need to reduce this noise. maybe your bench
tool can help?
Thanks,
If no difference on your computer maybe it's just on (my) Mac silicon m2 pro? Are rustc nightly optimized (for apple chips I mean)?
bench is here (for Mac): https://formulae.brew.sh/formula/bench
also beginning with rust-bisect I'm not able to find a regression
64 bit linux, values are all over the place for both 1.79 and a recent nightly, but I don't see any significant difference. Same range, same mean value.
Thanks!
To test another platform On my old mac intel (2012 MB pro i7), timings are around 145ms both for stable and nightly.
On my mac silicon, I've installed rust 1.80 nightly and it runs in correct times:
# nightly-2024-06-07-aarch64-apple-darwin installed - rustc 1.80.0-nightly (98489f248 2024-06-06)
cargo +nightly-2024-06-07 build --release
% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time 42.37 ms (40.73 ms .. 43.81 ms)
0.996 R² (0.992 R² .. 0.998 R²)
mean 43.12 ms (41.77 ms .. 46.17 ms)
std dev 3.807 ms (1.810 ms .. 6.321 ms)
variance introduced by outliers: 34% (moderately inflated)
compared to rust 1.79 stable:
% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time 41.27 ms (39.92 ms .. 42.74 ms)
0.994 R² (0.985 R² .. 0.998 R²)
mean 41.06 ms (40.32 ms .. 42.00 ms)
std dev 1.724 ms (1.251 ms .. 2.311 ms)
variance introduced by outliers: 12% (moderately inflated)
I think I've found the date of the "regression":
between 21/06 and 22/06: Building and running this rust project in release mode https://github.com/GuillaumePressiat/rust_fwf_polars_example
# rustup install nightly-2024-06-22
cargo +nightly-2024-06-22 run --release
% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time 94.54 ms (82.58 ms .. 105.6 ms)
0.970 R² (0.916 R² .. 0.994 R²)
mean 100.5 ms (92.48 ms .. 113.0 ms)
std dev 16.04 ms (10.54 ms .. 21.82 ms)
variance introduced by outliers: 53% (severely inflated)
# rustup install nightly-2024-06-21
cargo +nightly-2024-06-21 run --release
% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time 42.77 ms (41.82 ms .. 43.96 ms)
0.997 R² (0.994 R² .. 0.999 R²)
mean 42.45 ms (41.52 ms .. 43.25 ms)
std dev 1.729 ms (1.277 ms .. 2.614 ms)
variance introduced by outliers: 13% (moderately inflated)
# rustup install nightly-2024-06-22
cargo +nightly-2024-06-22 run --release
% bench ./target/release/rust_fwf_polars
...
# rustup install nightly-2024-06-21
cargo +nightly-2024-06-22 run --release
% bench ./target/release/rust_fwf_polars
One of these cargo +nightly-2024-06-22
is not right.
One of these cargo +nightly-2024-06-22 is not right.
It was a typo. Thanks. Corrected
Anyone having time and motivation :-) can attempt to reduce the code to an actionable MCVE, then use that code to bisect the compiler using cargo-bisect-rustc and try narrowing down where the regression started. So, yes, if you want you can help, that would be greatly appreciated!
@apiraino I think I've done a piece of that ; regression started around 22/06 nightly build.. At least on my m2 silicon chip. But it's unclear if regression is for all platform or not..
I'm not sure to be able to reduce the reprex without polars dependency. thought it's a bit too high rust expertise for me.
At least on my m2 silicon chip. But it's unclear if regression is for all platform or not..
Can you reproduce the regression without the bench
tool?
Yes of course.
I have used the bench tool here just to expose more fiable stats about run times.
I have access to an M1. On nightly-2024-06-21 I see
╰ ➤ hyperfine -w1 -r20 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 48.3 ms ± 4.6 ms [User: 153.9 ms, System: 19.5 ms]
Range (min … max): 41.2 ms … 60.2 ms 20 runs
And on nightly-2024-06-22 I see
╰ ➤ hyperfine -w1 -r20 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 113.3 ms ± 15.2 ms [User: 478.7 ms, System: 20.5 ms]
Range (min … max): 98.7 ms … 148.6 ms 20 runs
I'll see if I can bisect this.
Thanks!
I've tried and been unable to reproduce this on a c6g AWS instance, which suggests that the codegen issue is somehow specific to aarch64-apple-darwin. Wouldn't be the first.
searched nightlies: from nightly-2024-06-20 to nightly-2024-06-23 regressed nightly: nightly-2024-06-22 searched commit range: https://github.com/rust-lang/rust/compare/684b3553f70148ded97a80371c2405984d4f6aa7...c1b336cb6b491b3be02cd821774f03af4992f413 regressed commit: https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb
Beyond this bisection, I don't know what else I can do to help here. What I'd want next is an instruction-level profile, but I do not know how to gather such data on MacOS. Ideally we'd have a small program, the existing reproducer of this regression is colossal. But I'm not sure how to minimize this program.
You can vendor everything with |cargo vendorthen start trying to time internal bits with Instant or similar. This way you can easily modify not only repro code but all the dependencies. One more thing to try is to poke at suspicious bits with
cargo-show-asm`. It should be able to show Mir as
well.
On Wed, Jun 26, 2024, 11:15 Ben Kimock @.***> wrote:
I've tried and been unable to reproduce this on a c6g AWS instance, which suggests that the codegen issue is somehow specific to aarch64-apple-darwin. Wouldn't be the first.
searched nightlies: from nightly-2024-06-20 to nightly-2024-06-23 regressed nightly: nightly-2024-06-22 searched commit range: 684b355...c1b336c https://github.com/rust-lang/rust/compare/684b3553f70148ded97a80371c2405984d4f6aa7...c1b336cb6b491b3be02cd821774f03af4992f413 regressed commit: 7a08f84 https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb bisected with cargo-bisect-rustc https://github.com/rust-lang/cargo-bisect-rustc v0.6.8
Host triple: aarch64-apple-darwin Reproduce with:
cargo bisect-rustc --start=2024-06-20 --end=2024-06-23 --prompt --script script
Beyond this bisection, I don't know what else I can do to help here. What I'd want next is an instruction-level profile, but I do not know how to gather such data on MacOS. Ideally we'd have a small program, the existing reproducer of this regression is colossal. But I'm not sure how to minimize this program.
— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/126937#issuecomment-2191968780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQFIYF6GBWDEZRTWYIAKDZJLLJHAVCNFSM6AAAAABJ3QGMNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJRHE3DQNZYGA . You are receiving this because you commented.Message ID: @.***>
These are based on cold builds with target and .lock deleted.
Intel core i5 1135g7 results: 8gb ram.
intel corei5 3470 results: 32gb ram.
Using the latest stable elapsed tauri: i5 1135g7: 1minute i5 3470: 3 minutes
@x4exr Please open a separate issue, your comment seems entirely unrelated to this issue.
thanks @saethlin.
You have precisely identified the regression commit https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb which contains multiple edits, so naively I guess it's possible to build rust droping edit one by one? I really ask by curiosity (have no expertise here).
I strongly advise against attempting to fix this regression by playing whack-a-mole with the contents of that PR. The change there is a change to the MIR inliner heuristics, which should not be capable of producing a 100% runtime regression. Most likely the actual bug here is in LLVM, and by dumb luck that MIR inliner change happens to not produce the IR pattern that a very important Apple Silicon architecture-specific optimization is looking for. Understanding that and fixing it in LLVM is almost certainly the solution.
The right way to investigate this and most bugs in the compiler is to bisect to the regressing commit then reduce the reproducer, that's why we have issue labels for needs-bisection and needs-mcve. This issue needs a minimal reproducer.
Very clear and interesting. Thanks for your explanation.
For the minimal reproducer I've understood but don't know at all how to do it. A way to do this would be to extract and use minimal polars internals without polars dependency? Don't know if it's feasible.
Here I've studied polars expression to find which one slow down in regression but all expressions seems to be involved.
Here I've studied polars expression to find which one slow down in regression but all expressions seems to be involved.
That's good, pick the slowest and keep measuring its internals.
Tested om my intel-lubuntu machine, Lubuntu 22.04.4 LTS x86_64 (Intel i7-7700K (8) @ 4.500GHz) and I can also see a difference between 1.81-nigthly and 1.79
hyperfine --style basic --warmup 10 -m 100 target/release/rust_fwf_polars_179
Benchmark 1: target/release/rust_fwf_polars_179
Time (mean ± σ): 82.3 ms ± 7.5 ms [User: 355.5 ms, System: 45.7 ms]
Range (min … max): 72.1 ms … 110.0 ms 100 runs
hyperfine --style basic --warmup 10 -m 100 target/release/rust_fwf_polars_181-nightly
Benchmark 1: target/release/rust_fwf_polars_181-nightly
Time (mean ± σ): 93.3 ms ± 12.0 ms [User: 419.2 ms, System: 45.6 ms]
Range (min … max): 81.7 ms … 125.2 ms 100 runs
Based on the confidence intervals provided, those benchmark reports are not significantly different.
Indeed. To be clear for external readers of this issue it's not the case on silicon chips for mac:
The difference is clear here.
nightly 06/22 slower
% hyperfine -w1 -r50 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 94.7 ms ± 18.4 ms [User: 445.0 ms, System: 20.6 ms]
Range (min … max): 67.4 ms … 157.6 ms 50 runs
than 21/06
% hyperfine -w1 -r50 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 39.0 ms ± 4.0 ms [User: 149.5 ms, System: 20.1 ms]
Range (min … max): 31.7 ms … 51.0 ms 50 runs
Here multiple branches https://github.com/GuillaumePressiat/rust_fwf_polars_example shows that problem seems to be in the polars slice. I've tried to study polars slice internals how to make a minimal reprex but it's difficult to me and I haven't many time to work on it.
No difference with the default branch.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
elakelaiset% lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
elakelaiset% cat nightly-2024-06-21
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 76.6 ms ± 9.0 ms [User: 340.6 ms, System: 53.2 ms]
Range (min … max): 61.1 ms … 99.9 ms 50 runs
elakelaiset% cat nightly-2024-06-22
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 73.1 ms ± 8.5 ms [User: 331.5 ms, System: 51.0 ms]
Range (min … max): 55.5 ms … 90.2 ms 50 runs
yes the problem is inside LLVM for Mac M chips only.
yes the problem is inside LLVM for Mac M chips only.
I misread your comment :)
yes the problem is inside LLVM for Mac M chips only.
I misread your comment :)
maybe my English is bad! :)
@GuillaumePressiat Was just glancing at your https://github.com/GuillaumePressiat/rust_fwf_polars_example and is it needed that the data file is that large? If you cut it down to only a couple lines does it still reproduce? If you try cutting out random portions of the code can you still get it to reproduce? Remember it doesn't need to produce a similar output to reproduce, just needs to exhibit the same regression.
@mlindner thanks! in fact, the data file is big because my real data is big and this is where I observed the regression. And I have doubt that hyperfine is significant with really small run times. With a file of just a few lines, hyperfine shows no real difference (20, 100, 1000 lines). With thousands of lines (15000 for instance) difference begins to appear.
If you try cutting out random portions of the code can you still get it to reproduce?
for this yes I've tried and identified that the str.slice seems to be where it slows. https://github.com/GuillaumePressiat/rust_fwf_polars_example/blob/main/README.md#analysis-of-what-is-slow
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-medium
As someone not familiar with the process, would it make sense to, and how would someone go about, bisecting the rust compiler with different LLVM versions given that this is presumed to be an LLVM bug? Could that exclude the possibility that this is an LLVM regression and actually just a LLVM bug that's always existed?
Pinging @rust-lang/wg-llvm because I think this is an LLVM codegen change that seems specific to Apple Silicon. I don't think there's anything else the community is likely to do to narrow this down and we need a backend expert who has Apple hardware on hand.
Lowering the optimization level can improve performance (lol):
[profile.release.package.polars-ops]
opt-level = 1
But this weekend, I plan to address some other issues.
Indeed
# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 1
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 72.6 ms ± 6.6 ms [User: 345.6 ms, System: 21.3 ms]
Range (min … max): 59.7 ms … 87.7 ms 20 runs
# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 1
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 60.2 ms ± 5.8 ms [User: 261.9 ms, System: 20.9 ms]
Range (min … max): 51.8 ms … 79.1 ms 20 runs
# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 2
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 101.1 ms ± 22.5 ms [User: 460.6 ms, System: 21.1 ms]
Range (min … max): 74.0 ms … 179.0 ms 20 runs
# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 2
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 39.8 ms ± 4.5 ms [User: 150.7 ms, System: 21.1 ms]
Range (min … max): 34.2 ms … 53.7 ms 20 runs
# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 3
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 92.8 ms ± 15.2 ms [User: 447.2 ms, System: 20.6 ms]
Range (min … max): 65.8 ms … 128.9 ms 20 runs
# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 3
% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
Time (mean ± σ): 38.3 ms ± 3.6 ms [User: 150.6 ms, System: 20.2 ms]
Range (min … max): 32.7 ms … 44.9 ms 20 runs
I made some progress investigating this issue last night, so I looked at it further today.
I was lucky enough to find the relevant code quickly: substring_ternary and EarlyIfConversion.
BTW, I bisected polars-ops
using the RUSTC
environment variable and the following script:
#!/usr/bin/env bash
for arg in "$@"
do
if [ "$arg" == "polars_ops" ]; then
EXTRA_ARGS="-Copt-level=2 -Cllvm-args=-opt-bisect-limit=$OPT_BISECT_LIMIT -Zno-parallel-backend"
exit 0
fi
done
~/.rustup/toolchains/nightly-2024-06-22-aarch64-apple-darwin/bin/rustc "$@" $EXTRA_ARGS
By adding -Cllvm-args=-disable-early-ifcvt
, I believe the problem is here.
A minimally reproducible code is as follows:
#![allow(internal_features)]
#![feature(str_internals, core_intrinsics)]
use core::str::next_code_point;
use std::intrinsics::transmute_unchecked;
#[no_mangle]
#[inline(never)]
pub unsafe fn substring_ternary(opt_str_val: &[u8], opt_offset: usize) -> bool {
let mut iter = opt_str_val.iter();
for _ in 0..opt_offset {
let d = next_code_point(&mut iter);
let d = match d {
Some(c) => {
let c = transmute_unchecked::<u32, char>(c) as u32;
c + 1
}
None => 0,
};
if d == 0 {
return false;
}
}
true
}
pub fn main() {
let len = 10000000;
let data = "000000000000000000000000";
// let data = "😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀";
let mut result = true;
for _ in 0..len {
let s = unsafe { substring_ternary(data.as_bytes(), 10) };
result &= s;
}
println!("{result}");
}
@rustbot label -E-needs-mcve +S-has-mcve
Compare next_code_point
to assembly code: https://rust.godbolt.org/z/n4Gn8osdr on apple-m1 LLVM hoists instructions on the x >= 0xE0
and x >= 0xF0
branches.
Although I can read the backend code, I don't understand the CPU features involved, and from the comments it looks like EarlyIfConversion is mainly used on some CPUs with weak branch prediction, but I believe modern CPUs have good branch prediction. By debugging the code here, I found that one of the conditions is the branch prediction penalty: EarlyIfConversion.cpp#L914. apple-m1's mispredict penalty is 16, cortex-x1's is 14.
I'm guessing that recent CPUs should not use CycloneModel
but define a new SchedMachineModel
?
Thank you very much @DianQK! I suppose we have to wait a little bit for llvm team to focus on this issue.
fwiw, and I don't have time to provide any examples: when we switched from 1.79 to 1.81 we started getting OOM errors when compiling code with the newer version. In local linux dockers, in drone ec2 builds, on lambdas.
So for now we've decided to just pin to 1.79.
May or may not be related to this issue.
Hi,
I've build my rust project with Rust 1.79 and now test with Rust 1.81 nightly.
I observe slower runtime.
This project is about parsing text files with polars. Cargo.toml is the same in both context.
With Rust 1.79
With Rust 1.81
It's an issue just to keep trace of this observation, is this feedback something you've encountered before? Maybe it's just normal as 1.81 is a beta.
If not, I can make a reproducible example I think.
Thanks