rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.2k stars 12.7k forks source link

Performance regression between v1.76.0 and v1.77.2 #125543

Open jmillikin opened 5 months ago

jmillikin commented 5 months ago

Code

I'm working on some low-level bit-manipulation code, and discovered that v1.77 generates code that runs at about half the speed compared to the output of v1.76. Since the v1.77 release notes don't mention anything about an LLVM upgrade I figure there must be something going wrong in the rustc LLVM IR generation.

Compilable reproduction:

The inner loop is decode_u32(), which is sort of ugly but isn't doing anything especially complex.

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
    if buf[0] < 0x80 {
        return (buf[0] as u32, 1);
    }
    if buf[0] < 0xF0 {
        let x = u32::from_le(unsafe {
            ptr::from_ref(buf).cast::<u32>().read_unaligned()
        });
        if buf[0] < 0b11000000 {
            let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
            return (decoded, 2);
        }
        if buf[0] < 0b11100000 {
            let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
            return (decoded, 3);
        }
        let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
        return (decoded, 4);
    }
    let decoded = u32::from_le(unsafe {
        ptr::from_ref(buf)
            .cast::<u8>()
            .add(1)
            .cast::<u32>()
            .read_unaligned()
    });
    (decoded, ((buf[0] & 0x0F) + 2) as usize)
}

The exact timing varies depending on the distribution of input values. The attached reproduction uses a distribution for which the new code is about twice as slow.

$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release
$ time ./target/release/main
real    0m4.182s
user    0m4.181s
sys     0m0.000s
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release
$ time ./target/release/main
real    0m7.699s
user    0m7.694s
sys     0m0.004s

Version it worked on

It most recently worked on: Rust v1.76.0

Version with regression

rustc --version --verbose:

$ rustc +1.77.2 --version --verbose
rustc 1.77.2 (25ef9e3d8 2024-04-09)
binary: rustc
commit-hash: 25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
commit-date: 2024-04-09
host: x86_64-unknown-linux-gnu
release: 1.77.2
LLVM version: 17.0.6
blyxyas commented 5 months ago

Here are the contents of main.rs:

/*

[package] name = "perf-regression-debug" version = "0.1.0" edition = "2021"

[dependencies] rand = { version = "0.8.5", features = ["alloc", "small_rng"] }

[[bin]] name = "main" path = "main.rs"

$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release $ time ./target/release/main real 0m4.182s user 0m4.181s sys 0m0.000s

$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release $ time ./target/release/main real 0m7.699s user 0m7.694s sys 0m0.004s

$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release $ time ./target/release/main real 0m7.634s user 0m7.628s sys 0m0.005s


*/

use core::hint::black_box;
use core::ptr;

use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};

const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;

#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
    if buf[0] < 0x80 {
        return (buf[0] as u32, 1);
    }
    if buf[0] < 0xF0 {
        let x = u32::from_le(unsafe {
            ptr::from_ref(buf).cast::<u32>().read_unaligned()
        });
        if buf[0] < 0b11000000 {
            let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
            return (decoded, 2);
        }
        if buf[0] < 0b11100000 {
            let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
            return (decoded, 3);
        }
        let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
        return (decoded, 4);
    }
    let decoded = u32::from_le(unsafe {
        ptr::from_ref(buf)
            .cast::<u8>()
            .add(1)
            .cast::<u32>()
            .read_unaligned()
    });
    (decoded, ((buf[0] & 0x0F) + 2) as usize)
}

#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
    for buf in bufs {
        black_box(decode_u32(black_box(buf)));
    }
}

const U32_MASKS: [u32; 5] = [
    u32::MAX >> 25,
    u32::MAX >> 24,
    u32::MAX >> 16,
    u32::MAX >> 8,
    u32::MAX,
];

fn main() {
    let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
        .unwrap()
        .map(|ii| U32_MASKS[ii]);

    let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
    let encoded_values: Vec<[u8; 5]> =
        std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
            .map(|value| {
                let mut buf = [0u8; 5];
                encode_u32(&mut buf, value);
                buf
            })
            .take(RAND_VEC_LEN)
            .collect();

    for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
        benchmark_decode_u32(&encoded_values);
    }
}

fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
    if x & 0b01111111 == x {
        buf[0] = x as u8;
        return 1;
    }
    if x < 0x10000000 {
        if x < 0x00004000 {
            x <<= 2;
            buf[0] = 0b10000000 | ((x as u8) >> 2);
            buf[1] = (x >> 8) as u8;
            return 2;
        }
        if x < 0x00200000 {
            x <<= 3;
            buf[0] = 0b11000000 | ((x as u8) >> 3);
            buf[1] = (x >> 8) as u8;
            buf[2] = (x >> 16) as u8;
            return 3;
        }
        x <<= 4;
        buf[0] = 0b11100000 | ((x as u8) >> 4);
        buf[1] = (x >> 8) as u8;
        buf[2] = (x >> 16) as u8;
        buf[3] = (x >> 24) as u8;
        return 4;
    }
    let bytes = x.to_le_bytes();
    buf[0] = 0b11110011;
    buf[1] = bytes[0];
    buf[2] = bytes[1];
    buf[3] = bytes[2];
    buf[4] = bytes[3];
    5
}
workingjubilee commented 5 months ago

@jmillikin How does the performance go on 1.78.0, the current stable?

jmillikin commented 5 months ago

v1.78.0 generates the same assembly for {benchmark_,}decode_u32 as v1.77.2: https://godbolt.org/z/a5hcnh9az.

workingjubilee commented 5 months ago

wow, and that's the version that actually has the LLVM upgrade.

jmillikin commented 5 months ago

I tried using cargo-bisect-rustc and it pointed to https://github.com/rust-lang/rust/commit/8d76d076665f862ec9619f2de68d6d9ca1db4601, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.

cc @cjgillot who may have some ideas on whether the new code is working as intended or not.

My suspicion is that unification of buf[0] with the buf[0:4] as u32 is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.


searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01 regressed nightly: nightly-2023-12-31 searched commit range: https://github.com/rust-lang/rust/compare/3cdd004e55c869faa2b7b25efd3becf50346e7d6...2a3e63551fe21458637480a97b65a2d15dec8062 regressed commit: https://github.com/rust-lang/rust/commit/8d76d076665f862ec9619f2de68d6d9ca1db4601

bisected with cargo-bisect-rustc v0.6.8 Host triple: x86_64-unknown-linux-gnu Reproduce with: ```bash cargo bisect-rustc --start=1.76.0 --end=1.77.0 --script=./test.sh ```
blyxyas commented 5 months ago

@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh? :sweat_smile:

jmillikin commented 5 months ago

It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.

#!/bin/sh
set -eux

RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"  --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?

A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" in a subshell and then check if the resulting wall-clock time measurement is too large.

jmillikin commented 5 months ago

Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:

pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
    let prefix_0 = buf[0];
    if prefix_0 < 0x80 {
        return (prefix_0 as u32, 1);
    }

    if prefix_0 < 0xF0 {
        let x = u32::from_le(unsafe {
            ptr::from_ref(buf).cast::<u32>().read_unaligned()
        });

        let prefix_1 = buf[0];
        // let prefix_1 = prefix_0;
        if prefix_1 < 0b11000000 {
            let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
            return (decoded, 2);
        }
        if prefix_1 < 0b11100000 {
            let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
            return (decoded, 3);
        }
        let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
        return (decoded, 4);
    }
    let decoded = u32::from_le(unsafe {
        ptr::from_ref(buf)
            .cast::<u8>()
            .add(1)
            .cast::<u32>()
            .read_unaligned()
    });
    (decoded, ((prefix_0 & 0x0F) + 2) as usize)
}

This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.

If the statement let prefix_1 = prefix_0; is uncommented, then both compiler versions emit the slow assembly.

apiraino commented 5 months ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

saethlin commented 5 months ago

Disabling GVN via -Zmir-enable-passes=-GVN produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).