rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.6k stars 12.62k forks source link

Regression: `tests/codegen/issues/issue-101082.rs` fails with `-Ctarget-cpu=x86-64-v3` #131563

Open cuviper opened 1 day ago

cuviper commented 1 day ago

Code

I tried this test with -Ctarget-cpu=x86-64-v3 (which we have on by default in the upcoming RHEL 10):

tests/codegen/issues/issue-101082.rs

//@ compile-flags: -O

#![crate_type = "lib"]

#[no_mangle]
pub fn test() -> usize {
    // CHECK-LABEL: @test(
    // CHECK: ret {{i64|i32}} 165
    let values = [23, 16, 54, 3, 60, 9];
    let mut acc = 0;
    for item in values {
        acc += item;
    }
    acc
}

I expected to see this happen: FileCheck pass

Instead, this happened:

issue-101082.rs:8:12: error: CHECK: expected string not found in input

As of rustc 1.83.0-nightly (52fd99839 2024-10-10), that LLVM IR is:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %iter = alloca [64 x i8], align 8
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %iter)
  store i64 23, ptr %iter, align 8
  %_3.sroa.0.sroa.4.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 8
  store i64 16, ptr %_3.sroa.0.sroa.4.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.5.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 16
  store i64 54, ptr %_3.sroa.0.sroa.5.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.6.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 24
  store i64 3, ptr %_3.sroa.0.sroa.6.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.7.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 32
  store i64 60, ptr %_3.sroa.0.sroa.7.0.iter.sroa_idx, align 8
  %_3.sroa.0.sroa.8.0.iter.sroa_idx = getelementptr inbounds i8, ptr %iter, i64 40
  store i64 9, ptr %_3.sroa.0.sroa.8.0.iter.sroa_idx, align 8
  %unmaskedload = load <4 x i64>, ptr %iter, align 8, !alias.scope !3
  %0 = getelementptr inbounds i8, ptr %iter, i64 32
  %unmaskedload10 = load <4 x i64>, ptr %0, align 8, !alias.scope !3
  %1 = add <4 x i64> %unmaskedload10, %unmaskedload
  %2 = shufflevector <4 x i64> %1, <4 x i64> %unmaskedload, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %3 = tail call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %2)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %iter)
  ret i64 %3
}

Reducing to x86-64-v2 does get the expected output:

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(inaccessiblemem: write) uwtable
define noundef i64 @test() unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  ret i64 165
}

Version it worked on

It most recently worked on: Rust 1.79.0

Version with regression

rustc --version --verbose:

rustc 1.80.0 (051478957 2024-07-21)
binary: rustc
commit-hash: 051478957371ee0084a7c0913941d2a8c4757bb9
commit-date: 2024-07-21
host: x86_64-unknown-linux-gnu
release: 1.80.0
LLVM version: 18.1.7

Note that the original issue #101082 was fixed by an LLVM upgrade. That version didn't change between 1.79.0 and 1.80.0, but there were some additional cherry-picks: https://github.com/rust-lang/llvm-project/compare/rustc-1.79.0...rustc-1.80.0

However, cargo-bisect-rustc narrowed down to something else.

Bisection

searched nightlies: from nightly-2024-04-28 to nightly-2024-10-11 regressed nightly: nightly-2024-05-26 searched commit range: https://github.com/rust-lang/rust/compare/36153f1a4e3162f0a143c7b3e468ecb3beb0008e...1ba35e9bb44d416fc2ebf897855454258b650b01 regressed commit: https://github.com/rust-lang/rust/commit/48f00110d0dae38b3046a9ac05d20ea321fd6637 (#121571)

bisected with cargo-bisect-rustc v0.6.9 Host triple: x86_64-unknown-linux-gnu Reproduce with: ```bash cargo bisect-rustc --script test.sh --start 1.79.0 ``` ```sh #!/bin/sh rustc --emit=llvm-ir -O -Ctarget-cpu=x86-64-v3 -o- issue-101082.rs | FileCheck issue-101082.rs ```

@rustbot modify labels: +A-codegen +regression-from-stable-to-stable -regression-untriaged

cuviper commented 1 day ago

Part of #121571 added UB checks on indexing, and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks], though we expect it to still be optimized away. I'm guessing that x86-64-v3 enables some transformation that disrupts the overall optimization, which may be buggy or just unfortunate.

Some of that was undone in #126299, but the changes in index_range.rs remain. In a quick test reverting those that are definitely superfluous, I do get back to full optimization here. I'll prepare a PR for that, but there's still probably an LLVM improvement to be had too.

saethlin commented 23 hours ago

and AIUI these survive to LLVM IR due to core's #![rustc_preserve_ub_checks]

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

I added the A-LLVM label because I suspect that the desirable sequence of LLVM behavior here was relying on MIR inlining. It wouldn't be the first time, and impeding the MIR inliner is the primary thing that these assertions do when disabled.

This produces that bad IR:

RUSTFLAGS="-Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu

And this produces the good IR (the normal inline-mir-hint-threshold is 100):

RUSTFLAGS="-Zinline-mir-hint-threshold=1000 -Ctarget-cpu=x86-64-v3 --emit=llvm-ir" cargo +nightly b --release -Zbuild-std --target=x86_64-unknown-linux-gnu
cuviper commented 22 hours ago

They shouldn't survive. That attribute is supposed to get them past MIR optimizations, but when we lower to LLVM IR with debug assertions disabled there should be at most a zombie br bbX.

Well, it looks like a lot more here with -Cno-prepopulate-passes: https://rust.godbolt.org/z/n8reE6nzG

But yes, I would still hope that LLVM could chew through it, since it does with other CPUs. AFAICS our IR does not change from the target-cpu, apart from the expected function attributes.

saethlin commented 21 hours ago

The IR size difference for <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut is due to the use of the unchecked math functions in both library/core/src/slice/index.rs and library/core/src/ops/index_range.rs.

The IR for that link looks a lot better in nightly, I wonder if that's https://github.com/rust-lang/rust/pull/129283. https://github.com/rust-lang/rust/pull/126299 is also helping even if you just go to 1.81 I think.

Also the fact that there's a UB check in that IR at all is a bug. I'm looking into it.

saethlin commented 21 hours ago

Filed https://github.com/rust-lang/rust/issues/131578

saethlin commented 1 hour ago

I've locally added enough post-mono MIR optimizations to grind down the -Cno-prepopulate-passes LLVM IR for the IndexRange as SliceIndex>::get_unchecked to:

; <core::ops::index_range::IndexRange as core::slice::index::SliceIndex<[T]>>::get_unchecked_mut
; Function Attrs: inlinehint nonlazybind uwtable
define internal { ptr, i64 } @"_ZN104_$LT$core..ops..index_range..IndexRange$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$17get_unchecked_mut17hfdf5440029c89a21E"(i64 noundef %0, i64 noundef %1, ptr noundef %slice.0, i64 noundef %slice.1) unnamed_addr #0 {
start:
  %self = alloca [16 x i8], align 8
  store i64 %0, ptr %self, align 8
  %2 = getelementptr inbounds i8, ptr %self, i64 8
  store i64 %1, ptr %2, align 8
  %offset = load i64, ptr %self, align 8, !noundef !3
  %3 = getelementptr inbounds i8, ptr %self, i64 8
  %self1 = load i64, ptr %3, align 8, !noundef !3
  %len = sub nuw i64 %self1, %offset
  %ptr = getelementptr inbounds i64, ptr %slice.0, i64 %offset
  %4 = insertvalue { ptr, i64 } poison, ptr %ptr, 0
  %5 = insertvalue { ptr, i64 } %4, i64 %len, 1
  ret { ptr, i64 } %5
}

As far as I can tell, this is at least as good input to LLVM as we provided on 1.79, but we still have a missed optimization with -Ctarget-cpu=x86-64-v3.

I've also tried your branch. I cannot even find this function in the LLVM IR, because in your branch it always gets inlined in MIR, which is basically what I was speculating originally about LLVM relying on MIR inlining.

So I think there's a deeper problem here with LLVM on this target-cpu, and this change to the standard library is papering over the reproducer we have for it.