Open jmillikin opened 5 months ago
Here are the contents of main.rs
:
/*
[package] name = "perf-regression-debug" version = "0.1.0" edition = "2021"
[dependencies] rand = { version = "0.8.5", features = ["alloc", "small_rng"] }
[[bin]] name = "main" path = "main.rs"
$ RUSTFLAGS="-Copt-level=3" cargo +1.76.0 build --release $ time ./target/release/main real 0m4.182s user 0m4.181s sys 0m0.000s
$ RUSTFLAGS="-Copt-level=3" cargo +1.77.2 build --release $ time ./target/release/main real 0m7.699s user 0m7.694s sys 0m0.004s
$ RUSTFLAGS="-Copt-level=3" cargo +1.78.0 build --release $ time ./target/release/main real 0m7.634s user 0m7.628s sys 0m0.005s
*/
use core::hint::black_box;
use core::ptr;
use rand::distributions::{Distribution, WeightedIndex};
use rand::{Rng, SeedableRng};
const RAND_VEC_LEN: usize = 10000;
const BENCHMARK_LOOP_ITERATIONS: usize = 250000;
#[inline]
#[no_mangle]
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
if buf[0] < 0x80 {
return (buf[0] as u32, 1);
}
if buf[0] < 0xF0 {
let x = u32::from_le(unsafe {
ptr::from_ref(buf).cast::<u32>().read_unaligned()
});
if buf[0] < 0b11000000 {
let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
return (decoded, 2);
}
if buf[0] < 0b11100000 {
let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
return (decoded, 3);
}
let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
return (decoded, 4);
}
let decoded = u32::from_le(unsafe {
ptr::from_ref(buf)
.cast::<u8>()
.add(1)
.cast::<u32>()
.read_unaligned()
});
(decoded, ((buf[0] & 0x0F) + 2) as usize)
}
#[inline(never)]
#[no_mangle]
pub fn benchmark_decode_u32(bufs: &[[u8; 5]]) {
for buf in bufs {
black_box(decode_u32(black_box(buf)));
}
}
const U32_MASKS: [u32; 5] = [
u32::MAX >> 25,
u32::MAX >> 24,
u32::MAX >> 16,
u32::MAX >> 8,
u32::MAX,
];
fn main() {
let mixed_u32 = WeightedIndex::new(&[15, 60, 15, 5, 5])
.unwrap()
.map(|ii| U32_MASKS[ii]);
let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
let encoded_values: Vec<[u8; 5]> =
std::iter::from_fn(|| Some(rng.sample(&mixed_u32)))
.map(|value| {
let mut buf = [0u8; 5];
encode_u32(&mut buf, value);
buf
})
.take(RAND_VEC_LEN)
.collect();
for _ii in 0..BENCHMARK_LOOP_ITERATIONS {
benchmark_decode_u32(&encoded_values);
}
}
fn encode_u32(buf: &mut [u8; 5], mut x: u32) -> usize {
if x & 0b01111111 == x {
buf[0] = x as u8;
return 1;
}
if x < 0x10000000 {
if x < 0x00004000 {
x <<= 2;
buf[0] = 0b10000000 | ((x as u8) >> 2);
buf[1] = (x >> 8) as u8;
return 2;
}
if x < 0x00200000 {
x <<= 3;
buf[0] = 0b11000000 | ((x as u8) >> 3);
buf[1] = (x >> 8) as u8;
buf[2] = (x >> 16) as u8;
return 3;
}
x <<= 4;
buf[0] = 0b11100000 | ((x as u8) >> 4);
buf[1] = (x >> 8) as u8;
buf[2] = (x >> 16) as u8;
buf[3] = (x >> 24) as u8;
return 4;
}
let bytes = x.to_le_bytes();
buf[0] = 0b11110011;
buf[1] = bytes[0];
buf[2] = bytes[1];
buf[3] = bytes[2];
buf[4] = bytes[3];
5
}
@jmillikin How does the performance go on 1.78.0, the current stable?
v1.78.0 generates the same assembly for {benchmark_,}decode_u32
as v1.77.2: https://godbolt.org/z/a5hcnh9az.
wow, and that's the version that actually has the LLVM upgrade.
I tried using cargo-bisect-rustc
and it pointed to https://github.com/rust-lang/rust/commit/8d76d076665f862ec9619f2de68d6d9ca1db4601, which is a fairly large (+1,088 -1,581 delta) change to how constant propagation works.
cc @cjgillot who may have some ideas on whether the new code is working as intended or not.
My suspicion is that unification of buf[0]
with the buf[0:4] as u32
is causing either branch mispredictions or an unwanted data dependency, but I don't have enough knowledge of compiler internals to speak with any confidence.
searched nightlies: from nightly-2023-12-21 to nightly-2024-02-01 regressed nightly: nightly-2023-12-31 searched commit range: https://github.com/rust-lang/rust/compare/3cdd004e55c869faa2b7b25efd3becf50346e7d6...2a3e63551fe21458637480a97b65a2d15dec8062 regressed commit: https://github.com/rust-lang/rust/commit/8d76d076665f862ec9619f2de68d6d9ca1db4601
@jmillikin I also tried bisecting this, but had some problems with the script. If you don't mind, what's the content of your test.sh
? :sweat_smile:
It is super duper hacky -- literally grepping for an instruction pattern that only shows up in the fast version.
#!/bin/sh
set -eux
RUSTFLAGS="-Copt-level=3" cargo build --release
objdump -M intel "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main" --no-addresses --no-show-raw-insn --disassemble='benchmark_decode_u32' | grep 'd,DWORD PTR'
exit $?
A more portable version might be to invoke /bin/time -f '%e' "$CARGO_TARGET_DIR/x86_64-unknown-linux-gnu/release/main"
in a subshell and then check if the resulting wall-clock time measurement is too large.
Not sure if this is useful, but I found a way to make v1.76.0 emit the same code as v1.77.2:
pub fn decode_u32(buf: &[u8; 5]) -> (u32, usize) {
let prefix_0 = buf[0];
if prefix_0 < 0x80 {
return (prefix_0 as u32, 1);
}
if prefix_0 < 0xF0 {
let x = u32::from_le(unsafe {
ptr::from_ref(buf).cast::<u32>().read_unaligned()
});
let prefix_1 = buf[0];
// let prefix_1 = prefix_0;
if prefix_1 < 0b11000000 {
let decoded = (x & 0x3F) | ((x & 0xFF00) >> 2);
return (decoded, 2);
}
if prefix_1 < 0b11100000 {
let decoded = (x & 0x1F) | ((x & 0xFFFF00) >> 3);
return (decoded, 3);
}
let decoded = (x & 0x0F) | ((x & 0xFFFFFF00) >> 4);
return (decoded, 4);
}
let decoded = u32::from_le(unsafe {
ptr::from_ref(buf)
.cast::<u8>()
.add(1)
.cast::<u32>()
.read_unaligned()
});
(decoded, ((prefix_0 & 0x0F) + 2) as usize)
}
This version is compiled to the same assembly as the original in both v1.76.0 and v1.77.2.
If the statement let prefix_1 = prefix_0;
is uncommented, then both compiler versions emit the slow assembly.
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-medium
Disabling GVN via -Zmir-enable-passes=-GVN
produces the fast version on nightly. The MIR inliner is not relevant in case anyone was wondering that (I was).
Code
I'm working on some low-level bit-manipulation code, and discovered that v1.77 generates code that runs at about half the speed compared to the output of v1.76. Since the v1.77 release notes don't mention anything about an LLVM upgrade I figure there must be something going wrong in the
rustc
LLVM IR generation.Compilable reproduction:
The inner loop is
decode_u32()
, which is sort of ugly but isn't doing anything especially complex.The exact timing varies depending on the distribution of input values. The attached reproduction uses a distribution for which the new code is about twice as slow.
Version it worked on
It most recently worked on: Rust v1.76.0
Version with regression
rustc --version --verbose
: