Rust 1.81-nightly slower than Rust 1.79 ?

GuillaumePressiat commented 3 months ago

Hi,

I've build my rust project with Rust 1.79 and now test with Rust 1.81 nightly.

I observe slower runtime.

This project is about parsing text files with polars. Cargo.toml is the same in both context.

With Rust 1.79

# cargo build release with 1.79.0
cargo +1.79.0 build --release      
...

# benchmark on release run 
bench ./target/release/pmsi_rust   
benchmarking ./target/release/pmsi_rust
time                 236.9 ms   (226.6 ms .. 249.0 ms)
                     0.998 R²   (0.991 R² .. 1.000 R²)
mean                 232.1 ms   (228.1 ms .. 236.8 ms)
std dev              5.880 ms   (3.339 ms .. 8.859 ms)
variance introduced by outliers: 14% (moderately inflated)

With Rust 1.81

# With 1.81 version
% cargo --version      
cargo 1.81.0-nightly (bc89bffa5 2024-06-22)
# build release with 1.81
% cargo build --release           
...
# benchmark on release run
% bench ./target/release/pmsi_rust
benchmarking ./target/release/pmsi_rust
time                 614.8 ms   (528.0 ms .. 710.4 ms)
                     0.997 R²   (0.989 R² .. 1.000 R²)
mean                 644.2 ms   (622.9 ms .. 664.8 ms)
std dev              25.50 ms   (11.26 ms .. 31.11 ms)
variance introduced by outliers: 19% (moderately inflated)

It's an issue just to keep trace of this observation, is this feedback something you've encountered before? Maybe it's just normal as 1.81 is a beta.

If not, I can make a reproducible example I think.

Thanks

the8472 commented 3 months ago

is this feedback something you've encountered before?

Not sure what you're asking. Yes, we do get performance regression reports and where possible we do want to fix them. But your report is not detailed enough to tell if it's similar to any existing issue.

Maybe it's just normal as 1.81 is a beta.

Beta builds are not expect to be slower than stable builds.

If not, I can make a reproducible example I think.

Yes, that would be necessary to narrow down what's causing it.

GuillaumePressiat commented 3 months ago

Yes, that would be necessary to narrow down what's causing it.

I'm on it!

GuillaumePressiat commented 3 months ago

Thanks for your response.

Here is a repo with a comparison in a reproducible manner (as a rust project..).

https://github.com/GuillaumePressiat/rust_fwf_polars_example

I hope it's fine like this

GuillaumePressiat commented 3 months ago

Do you want some export of samply profile?

apiraino commented 3 months ago

hey @GuillaumePressiat thanks for reporting and bulding a sample of your code. It depends heavily on polars so I wish there a way to reduce polars dependencies and compile a real minimal piece of code to point us to the rustc regression.

For this we tend to use our cargo-bisect-rustc tool to isolate where the regression started but in this case I think it's a bit a long process.

@rustbot label +I-slow +E-needs-mcve

GuillaumePressiat commented 3 months ago

Hi @apiraino

Thanks for your answer.

Do you mean that I have to isolate where the regression started using cargo-bisect-rustc (and lighten the reproducible example without polars) or it's a task that you'll do? To be clear on what I have to do to go beyond.

apiraino commented 3 months ago

Anyone having time and motivation :-) can attempt to reduce the code to an actionable MCVE, then use that code to bisect the compiler using cargo-bisect-rustc and try narrowing down where the regression started. So, yes, if you want you can help, that would be greatly appreciated!

I did a quick test with your code and on my machine the difference between nightly and stable compiler doesn't look that different (both are around ~100ms), so the bisection would need to reduce this noise. maybe your bench tool can help?

GuillaumePressiat commented 3 months ago

Thanks,

If no difference on your computer maybe it's just on (my) Mac silicon m2 pro? Are rustc nightly optimized (for apple chips I mean)?

bench is here (for Mac): https://formulae.brew.sh/formula/bench

also beginning with rust-bisect I'm not able to find a regression

pacak commented 3 months ago

64 bit linux, values are all over the place for both 1.79 and a recent nightly, but I don't see any significant difference. Same range, same mean value.

GuillaumePressiat commented 3 months ago

Thanks!

1/ no problem on an intel chip:

To test another platform On my old mac intel (2012 MB pro i7), timings are around 145ms both for stable and nightly.

2/ no problem with rust 1.80

On my mac silicon, I've installed rust 1.80 nightly and it runs in correct times:

# nightly-2024-06-07-aarch64-apple-darwin installed - rustc 1.80.0-nightly (98489f248 2024-06-06)
cargo +nightly-2024-06-07 build --release
% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time                 42.37 ms   (40.73 ms .. 43.81 ms)
                     0.996 R²   (0.992 R² .. 0.998 R²)
mean                 43.12 ms   (41.77 ms .. 46.17 ms)
std dev              3.807 ms   (1.810 ms .. 6.321 ms)
variance introduced by outliers: 34% (moderately inflated)

compared to rust 1.79 stable:

% bench ./target/release/rust_fwf_polars
benchmarking ./target/release/rust_fwf_polars
time                 41.27 ms   (39.92 ms .. 42.74 ms)
                     0.994 R²   (0.985 R² .. 0.998 R²)
mean                 41.06 ms   (40.32 ms .. 42.00 ms)
std dev              1.724 ms   (1.251 ms .. 2.311 ms)
variance introduced by outliers: 12% (moderately inflated)

GuillaumePressiat commented 3 months ago

I think I've found the date of the "regression":

between 21/06 and 22/06: Building and running this rust project in release mode https://github.com/GuillaumePressiat/rust_fwf_polars_example

# rustup install nightly-2024-06-22
cargo +nightly-2024-06-22 run --release 
% bench ./target/release/rust_fwf_polars   
benchmarking ./target/release/rust_fwf_polars
time                 94.54 ms   (82.58 ms .. 105.6 ms)
                     0.970 R²   (0.916 R² .. 0.994 R²)
mean                 100.5 ms   (92.48 ms .. 113.0 ms)
std dev              16.04 ms   (10.54 ms .. 21.82 ms)
variance introduced by outliers: 53% (severely inflated)

# rustup install nightly-2024-06-21
cargo +nightly-2024-06-21 run --release 
% bench ./target/release/rust_fwf_polars 
benchmarking ./target/release/rust_fwf_polars
time                 42.77 ms   (41.82 ms .. 43.96 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 42.45 ms   (41.52 ms .. 43.25 ms)
std dev              1.729 ms   (1.277 ms .. 2.614 ms)
variance introduced by outliers: 13% (moderately inflated)

saethlin commented 3 months ago

# rustup install nightly-2024-06-22
cargo +nightly-2024-06-22 run --release 
% bench ./target/release/rust_fwf_polars   
...
# rustup install nightly-2024-06-21
cargo +nightly-2024-06-22 run --release 
% bench ./target/release/rust_fwf_polars

One of these cargo +nightly-2024-06-22 is not right.

GuillaumePressiat commented 3 months ago

One of these cargo +nightly-2024-06-22 is not right.

It was a typo. Thanks. Corrected

GuillaumePressiat commented 3 months ago

Anyone having time and motivation :-) can attempt to reduce the code to an actionable MCVE, then use that code to bisect the compiler using cargo-bisect-rustc and try narrowing down where the regression started. So, yes, if you want you can help, that would be greatly appreciated!

@apiraino I think I've done a piece of that ; regression started around 22/06 nightly build.. At least on my m2 silicon chip. But it's unclear if regression is for all platform or not..

I'm not sure to be able to reduce the reprex without polars dependency. thought it's a bit too high rust expertise for me.

pacak commented 3 months ago

At least on my m2 silicon chip. But it's unclear if regression is for all platform or not..

Can you reproduce the regression without the bench tool?

GuillaumePressiat commented 3 months ago

Yes of course.

I have used the bench tool here just to expose more fiable stats about run times.

saethlin commented 3 months ago

I have access to an M1. On nightly-2024-06-21 I see

╰ ➤ hyperfine -w1 -r20 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      48.3 ms ±   4.6 ms    [User: 153.9 ms, System: 19.5 ms]
  Range (min … max):    41.2 ms …  60.2 ms    20 runs

And on nightly-2024-06-22 I see

╰ ➤ hyperfine -w1 -r20 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):     113.3 ms ±  15.2 ms    [User: 478.7 ms, System: 20.5 ms]
  Range (min … max):    98.7 ms … 148.6 ms    20 runs

I'll see if I can bisect this.

GuillaumePressiat commented 3 months ago

Thanks!

saethlin commented 3 months ago

I've tried and been unable to reproduce this on a c6g AWS instance, which suggests that the codegen issue is somehow specific to aarch64-apple-darwin. Wouldn't be the first.

searched nightlies: from nightly-2024-06-20 to nightly-2024-06-23 regressed nightly: nightly-2024-06-22 searched commit range: https://github.com/rust-lang/rust/compare/684b3553f70148ded97a80371c2405984d4f6aa7...c1b336cb6b491b3be02cd821774f03af4992f413 regressed commit: https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb

bisected with cargo-bisect-rustc v0.6.8

Host triple: aarch64-apple-darwin Reproduce with: ```bash cargo bisect-rustc --start=2024-06-20 --end=2024-06-23 --prompt --script script ```

Beyond this bisection, I don't know what else I can do to help here. What I'd want next is an instruction-level profile, but I do not know how to gather such data on MacOS. Ideally we'd have a small program, the existing reproducer of this regression is colossal. But I'm not sure how to minimize this program.

pacak commented 3 months ago

You can vendor everything with |cargo vendorthen start trying to time internal bits with Instant or similar. This way you can easily modify not only repro code but all the dependencies. One more thing to try is to poke at suspicious bits withcargo-show-asm`. It should be able to show Mir as well.

On Wed, Jun 26, 2024, 11:15 Ben Kimock @.***> wrote:

I've tried and been unable to reproduce this on a c6g AWS instance, which suggests that the codegen issue is somehow specific to aarch64-apple-darwin. Wouldn't be the first.

searched nightlies: from nightly-2024-06-20 to nightly-2024-06-23 regressed nightly: nightly-2024-06-22 searched commit range: 684b355...c1b336c https://github.com/rust-lang/rust/compare/684b3553f70148ded97a80371c2405984d4f6aa7...c1b336cb6b491b3be02cd821774f03af4992f413 regressed commit: 7a08f84 https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb bisected with cargo-bisect-rustc https://github.com/rust-lang/cargo-bisect-rustc v0.6.8

Host triple: aarch64-apple-darwin Reproduce with:

cargo bisect-rustc --start=2024-06-20 --end=2024-06-23 --prompt --script script

Beyond this bisection, I don't know what else I can do to help here. What I'd want next is an instruction-level profile, but I do not know how to gather such data on MacOS. Ideally we'd have a small program, the existing reproducer of this regression is colossal. But I'm not sure how to minimize this program.

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/126937#issuecomment-2191968780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQFIYF6GBWDEZRTWYIAKDZJLLJHAVCNFSM6AAAAABJ3QGMNOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJRHE3DQNZYGA . You are receiving this because you commented.Message ID: @.***>

x4exr commented 3 months ago

These are based on cold builds with target and .lock deleted.

Intel core i5 1135g7 results: 8gb ram.

Tauri never finished compiling after 40-50 minutes. Struggles with windows-bindgen. Brotli, i have never experienced completing. Each small dependency takes around 20 seconds to build. On stable each one takes half a second. And brotli takes around 12 seconds.

intel corei5 3470 results: 32gb ram.

Tauri also never compiles, this was also given about an hour. It also never finished compiling brotli.

Using the latest stable elapsed tauri: i5 1135g7: 1minute i5 3470: 3 minutes

saethlin commented 3 months ago

@x4exr Please open a separate issue, your comment seems entirely unrelated to this issue.

GuillaumePressiat commented 3 months ago

thanks @saethlin.

You have precisely identified the regression commit https://github.com/rust-lang/rust/commit/7a08f84627ff3035de4d66ff3209e5fc93165dcb which contains multiple edits, so naively I guess it's possible to build rust droping edit one by one? I really ask by curiosity (have no expertise here).

saethlin commented 3 months ago

I strongly advise against attempting to fix this regression by playing whack-a-mole with the contents of that PR. The change there is a change to the MIR inliner heuristics, which should not be capable of producing a 100% runtime regression. Most likely the actual bug here is in LLVM, and by dumb luck that MIR inliner change happens to not produce the IR pattern that a very important Apple Silicon architecture-specific optimization is looking for. Understanding that and fixing it in LLVM is almost certainly the solution.

The right way to investigate this and most bugs in the compiler is to bisect to the regressing commit then reduce the reproducer, that's why we have issue labels for needs-bisection and needs-mcve. This issue needs a minimal reproducer.

GuillaumePressiat commented 3 months ago

Very clear and interesting. Thanks for your explanation.

For the minimal reproducer I've understood but don't know at all how to do it. A way to do this would be to extract and use minimal polars internals without polars dependency? Don't know if it's feasible.

Here I've studied polars expression to find which one slow down in regression but all expressions seems to be involved.

https://github.com/GuillaumePressiat/rust_fwf_polars_example?tab=readme-ov-file#analysis-of-what-is-slow

pacak commented 3 months ago

Here I've studied polars expression to find which one slow down in regression but all expressions seems to be involved.

That's good, pick the slowest and keep measuring its internals.

kerjes commented 2 months ago

Tested om my intel-lubuntu machine, Lubuntu 22.04.4 LTS x86_64 (Intel i7-7700K (8) @ 4.500GHz) and I can also see a difference between 1.81-nigthly and 1.79

hyperfine --style basic --warmup 10 -m 100 target/release/rust_fwf_polars_179 
Benchmark 1: target/release/rust_fwf_polars_179
  Time (mean ± σ):      82.3 ms ±   7.5 ms    [User: 355.5 ms, System: 45.7 ms]
  Range (min … max):    72.1 ms … 110.0 ms    100 runs

hyperfine --style basic --warmup 10 -m 100 target/release/rust_fwf_polars_181-nightly 
Benchmark 1: target/release/rust_fwf_polars_181-nightly
  Time (mean ± σ):      93.3 ms ±  12.0 ms    [User: 419.2 ms, System: 45.6 ms]
  Range (min … max):    81.7 ms … 125.2 ms    100 runs

saethlin commented 2 months ago

Based on the confidence intervals provided, those benchmark reports are not significantly different.

GuillaumePressiat commented 2 months ago

Indeed. To be clear for external readers of this issue it's not the case on silicon chips for mac:

The difference is clear here.

nightly 06/22 slower

% hyperfine -w1 -r50 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      94.7 ms ±  18.4 ms    [User: 445.0 ms, System: 20.6 ms]
  Range (min … max):    67.4 ms … 157.6 ms    50 runs

than 21/06

% hyperfine -w1 -r50 --shell=none ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      39.0 ms ±   4.0 ms    [User: 149.5 ms, System: 20.1 ms]
  Range (min … max):    31.7 ms …  51.0 ms    50 runs

Here multiple branches https://github.com/GuillaumePressiat/rust_fwf_polars_example shows that problem seems to be in the polars slice. I've tried to study polars slice internals how to make a minimal reprex but it's difficult to me and I haven't many time to work on it.

pacak commented 2 months ago

No difference with the default branch.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

elakelaiset% lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

elakelaiset% cat nightly-2024-06-21
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      76.6 ms ±   9.0 ms    [User: 340.6 ms, System: 53.2 ms]
  Range (min … max):    61.1 ms …  99.9 ms    50 runs

elakelaiset% cat nightly-2024-06-22
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      73.1 ms ±   8.5 ms    [User: 331.5 ms, System: 51.0 ms]
  Range (min … max):    55.5 ms …  90.2 ms    50 runs

GuillaumePressiat commented 2 months ago

yes the problem is inside LLVM for Mac M chips only.

pacak commented 2 months ago

yes the problem is inside LLVM for Mac M chips only.

I misread your comment :)

GuillaumePressiat commented 2 months ago

yes the problem is inside LLVM for Mac M chips only.

I misread your comment :)

maybe my English is bad! :)

mlindner commented 2 months ago

@GuillaumePressiat Was just glancing at your https://github.com/GuillaumePressiat/rust_fwf_polars_example and is it needed that the data file is that large? If you cut it down to only a couple lines does it still reproduce? If you try cutting out random portions of the code can you still get it to reproduce? Remember it doesn't need to produce a similar output to reproduce, just needs to exhibit the same regression.

GuillaumePressiat commented 2 months ago

@mlindner thanks! in fact, the data file is big because my real data is big and this is where I observed the regression. And I have doubt that hyperfine is significant with really small run times. With a file of just a few lines, hyperfine shows no real difference (20, 100, 1000 lines). With thousands of lines (15000 for instance) difference begins to appear.

If you try cutting out random portions of the code can you still get it to reproduce?

for this yes I've tried and identified that the str.slice seems to be where it slows. https://github.com/GuillaumePressiat/rust_fwf_polars_example/blob/main/README.md#analysis-of-what-is-slow

apiraino commented 2 months ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

mlindner commented 2 months ago

As someone not familiar with the process, would it make sense to, and how would someone go about, bisecting the rust compiler with different LLVM versions given that this is presumed to be an LLVM bug? Could that exclude the possibility that this is an LLVM regression and actually just a LLVM bug that's always existed?

saethlin commented 2 months ago

Pinging @rust-lang/wg-llvm because I think this is an LLVM codegen change that seems specific to Apple Silicon. I don't think there's anything else the community is likely to do to narrow this down and we need a backend expert who has Apple hardware on hand.

DianQK commented 2 months ago

Lowering the optimization level can improve performance (lol):

[profile.release.package.polars-ops]
opt-level = 1

But this weekend, I plan to address some other issues.

GuillaumePressiat commented 2 months ago

Indeed

# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 1

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      72.6 ms ±   6.6 ms    [User: 345.6 ms, System: 21.3 ms]
  Range (min … max):    59.7 ms …  87.7 ms    20 runs

# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 1

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      60.2 ms ±   5.8 ms    [User: 261.9 ms, System: 20.9 ms]
  Range (min … max):    51.8 ms …  79.1 ms    20 runs

# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 2

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):     101.1 ms ±  22.5 ms    [User: 460.6 ms, System: 21.1 ms]
  Range (min … max):    74.0 ms … 179.0 ms    20 runs

# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 2

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      39.8 ms ±   4.5 ms    [User: 150.7 ms, System: 21.1 ms]
  Range (min … max):    34.2 ms …  53.7 ms    20 runs

# ## 2024-06-22
# [profile.release.package.polars-ops]
# opt-level = 3

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      92.8 ms ±  15.2 ms    [User: 447.2 ms, System: 20.6 ms]
  Range (min … max):    65.8 ms … 128.9 ms    20 runs

# ## 2024-06-21
# [profile.release.package.polars-ops]
# opt-level = 3

% hyperfine -N -w1 -r20 ./target/release/rust_fwf_polars
Benchmark 1: ./target/release/rust_fwf_polars
  Time (mean ± σ):      38.3 ms ±   3.6 ms    [User: 150.6 ms, System: 20.2 ms]
  Range (min … max):    32.7 ms …  44.9 ms    20 runs

DianQK commented 2 months ago

I made some progress investigating this issue last night, so I looked at it further today.

I was lucky enough to find the relevant code quickly: substring_ternary and EarlyIfConversion.

BTW, I bisected polars-ops using the RUSTC environment variable and the following script:

#!/usr/bin/env bash

for arg in "$@"
do
  if [ "$arg" == "polars_ops" ]; then
    EXTRA_ARGS="-Copt-level=2 -Cllvm-args=-opt-bisect-limit=$OPT_BISECT_LIMIT -Zno-parallel-backend"
    exit 0
  fi
done

~/.rustup/toolchains/nightly-2024-06-22-aarch64-apple-darwin/bin/rustc "$@" $EXTRA_ARGS

By adding -Cllvm-args=-disable-early-ifcvt, I believe the problem is here.

A minimally reproducible code is as follows：

#![allow(internal_features)]
#![feature(str_internals, core_intrinsics)]

use core::str::next_code_point;
use std::intrinsics::transmute_unchecked;

#[no_mangle]
#[inline(never)]
pub unsafe fn substring_ternary(opt_str_val: &[u8], opt_offset: usize) -> bool {
    let mut iter = opt_str_val.iter();
    for _ in 0..opt_offset {
        let d = next_code_point(&mut iter);
        let d = match d {
            Some(c) => {
                let c = transmute_unchecked::<u32, char>(c) as u32;
                c + 1
            }
            None => 0,
        };
        if d == 0 {
            return false;
        }
    }
    true
}

pub fn main() {
    let len = 10000000;
    let data = "000000000000000000000000";
    // let data = "😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀";
    let mut result = true;
    for _ in 0..len {
        let s = unsafe { substring_ternary(data.as_bytes(), 10) };
        result &= s;
    }
    println!("{result}");
}

@rustbot label -E-needs-mcve +S-has-mcve

Compare next_code_point to assembly code: https://rust.godbolt.org/z/n4Gn8osdr on apple-m1 LLVM hoists instructions on the x >= 0xE0 and x >= 0xF0 branches.

Although I can read the backend code, I don't understand the CPU features involved, and from the comments it looks like EarlyIfConversion is mainly used on some CPUs with weak branch prediction, but I believe modern CPUs have good branch prediction. By debugging the code here, I found that one of the conditions is the branch prediction penalty: EarlyIfConversion.cpp#L914. apple-m1's mispredict penalty is 16, cortex-x1's is 14.

I'm guessing that recent CPUs should not use CycloneModel but define a new SchedMachineModel ?

GuillaumePressiat commented 1 month ago

Thank you very much @DianQK! I suppose we have to wait a little bit for llvm team to focus on this issue.

thrichardson-twist commented 3 days ago

fwiw, and I don't have time to provide any examples: when we switched from 1.79 to 1.81 we started getting OOM errors when compiling code with the newer version. In local linux dockers, in drone ec2 builds, on lambdas.

So for now we've decided to just pin to 1.79.

May or may not be related to this issue.

rust-lang / rust