rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.03k stars 12.68k forks source link

coreos-installer test segfaults on s390x-unknown-linux-gnu #77382

Closed cuviper closed 4 years ago

cuviper commented 4 years ago

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1883457

In the Fedora build of coreos-installer 0.7.0, one of the tests crashed with SIGSEGV on s390x, while it passed on every other architecture. The reporter in Red Hat bugzilla found that it happens equally with system LLVM 11 or 10. The crate's Cargo.toml also has lto = true, but they didn't see any improvement after turning that off.

The distro builds run tests in release mode (to avoid recompiling from the main build) with flags in .cargo/config:

[build]
rustc = "/usr/bin/rustc"
rustdoc = "/usr/bin/rustdoc"
rustflags = ["-Copt-level=3", "-Cdebuginfo=2", "-Clink-arg=-Wl,-z,relro,-z,now", "-Ccodegen-units=1", "--cap-lints=warn", ]

The tests pass if we don't set -Ccodegen-units=1, which suggests to me that some relevant (mis)optimization is only triggered with the full unit.

The tests are also fine with Fedora's rust-1.45.2, only crashing on rust-1.46.0. I also reproduced the crash on rustup's stable 1.46.0, and I used cargo bisect-rustc to narrow down to nightly-2020-07-05 passing, while nightly-2020-07-06 crashes. That's 0cd7ff7ddfb75a38dca81ad3e76b1e984129e939...2753fab7ce3647033146b07c8b6c9f4856a910b0, which is just #73879 (cc @ecstatic-morse). I don't know if codegen-units would affect that, or if it's just triggering something new in LLVM.

Here's the reporter's backtrace:

#0  0x000002aa0c3616c4 in core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#1  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#2  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#3  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#4  core::ptr::drop_in_place () at /builddir/build/BUILD/rustc-1.46.0-src/src/libcore/ptr/mod.rs:184
#5  <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources (self=0x3ffcc3fc4f8) at src/source.rs:135
#6  libcoreinst::download::tests::test_write_image_limit () at src/download.rs:489
  1. https://github.com/coreos/coreos-installer/blob/79b59763bd17afba9ac6b13dfdfef401e2d9bde9/src/source.rs#L135
  2. https://github.com/coreos/coreos-installer/blob/79b59763bd17afba9ac6b13dfdfef401e2d9bde9/src/download.rs#L489
ecstatic-morse commented 4 years ago

cc #74551, another miscompilation on a tier 2 architecture that appeared after a similar optimization. If the root cause is the same, we should have a better chance of diagnosing it now, since it occurs in a crate's test case instead of while compiling Firefox. cc @pnkfelix as well.

apiraino commented 4 years ago

Assigning P-critical as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

cuviper commented 4 years ago

Is -Zmir-opt-level=0 supposed to avoid #73879's changes? That was mentioned in zulip #t-compiler/meetings, but I still get the crash with that option on nightly-2020-07-06.

ecstatic-morse commented 4 years ago

No. That person was mistaken. Drop elaboration runs unconditionally as part of "post_borrowck_cleanup".

cuviper commented 4 years ago

Updating for cross-reference: I tried -Zprecise-enum-drop-elaboration=no in https://github.com/rust-lang/rust/pull/77423#issuecomment-702900803, but it didn't help.

ecstatic-morse commented 4 years ago

As a sanity check, was there any difference in the generated MIR for <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources? I tested the flag locally and it worked, but we should make sure. If there's no trivial error with the flag, I'm not sure what's going on.

I guess we could with the more precise version of MaybeInitializedPlaces but the less precise version of MaybeUninitializedPlaces (that was the status quo before #73879 ). I would be surprised if we had a miscompilation when using the precise version of both (master) or the imprecise version of both (master with -Zprecise-enum-drop-elaboration or a release prior to #68528) but not with one precise and one imprecise (a release between #68528 and #73879).

cuviper commented 4 years ago

Here are the test's target/release/deps/libcoreinst* after --emit=mir,llvm-ir -Csave-temps:

toolchain download
:heavy_check_mark: nightly-2020-07-05 target-2020-07-05.tar.gz
:x: nightly-2020-07-06 target-2020-07-06.tar.gz
:x: nightly-2020-10-01 target-2020-10-01.tar.gz
:x: #77423 -Zprecise-enum-drop-elaboration=no target-pr77423.tar.gz

They were compiled on Fedora Rawhide (34), in case anyone wants to find libraries to actually run the test executables.

cuviper commented 4 years ago

Here's std::io::error::Repr for reference:

enum Repr {
    Os(i32),
    Simple(ErrorKind),
    Custom(Box<Custom>),
}

From <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources:

            Err(err) => {
                eprintln!("Couldn't read signature file: {}", err);
                None
            }

The error we get during the test case is Repr::Os(2), ENOENT, "No such file or directory", which prints fine.

With nightly-2020-07-05, the assembly after printing is:

+3102>       brasl   %r14,0x2aa0034d340 <std::io::stdio::_eprint>
+3108>       cli     504(%r15),2
+3112>       mvghi   704(%r15),0
+3118>       jl      0x2aa000cd5b8 <libcoreinst::download::tests::test_write_image_limit+3480>
+3122>       lg      %r13,512(%r15)
+3128>       lg      %r1,8(%r13)
+3134>       lg      %r2,0(%r13)

I believe that cli 2 is comparing the Repr tag, where only Custom needs dropping. When I step through, that jl branch is taken to skip ahead. Otherwise, you can see that it would load 512(%r15) (offset 8 from before) for the Box into %r13, then proceed to dereference that pointer.

With nightly-2020-07-06, the assembly after printing is:

+4514>       brasl   %r14,0x2aa00349d40 <std::io::stdio::_eprint>
+4520>       cli     552(%r15),2
+4524>       llilh   %r1,16
+4528>       agr     %r1,%r15
+4532>       mvghi   768(%r1),0
+4538>       jl      0x2aa000cc2ce <libcoreinst::download::tests::test_write_image_limit+4718>
+4542>       lg      %r13,560(%r15)
+4548>       lg      %r1,8(%r13)
+4554>       lg      %r2,0(%r13)

Looks similar with cli 2, but I don't really speak SystemZ to guess what llilh and agr are doing. But in the debugger I see that this jl branch is not taken, and then it loads a garbage pointer into %r13 (always seems to be 0x21). The load of 8(%r13) is where we SIGSEGV.

cuviper commented 4 years ago

Found the instruction reference...

So it seems to be clobbering the CLI condition.

I'm guessing the revised enum drops just exposed an existing codegen bug.

ecstatic-morse commented 4 years ago

Nice!

Since I didn't want you to have all the fun, here is the LLVM IR for the relevant section. The first basic block is the comparison with the enum discriminant of io::Error and the second is the first piece of drop glue. The structure looks to be the exact same, but there are additional !noalias annotations in the one with the miscompilation. I think these are correct in a vacuum, but I don't have enough LLVM knowledge to say for sure. I certainly don't have enough to hypothesize why they caused the instruction scheduler to forget all about the need to preserve flags/condition codes.

2020-07-05:

bb60.i:                                           ; preds = %bb59.i
  call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %456), !dbg !99333, !noalias !98802
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %383), !dbg !99333, !noalias !98802
  %464 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !99342
  store {}* null, {}** %464, align 8, !dbg !99342, !noalias !98802
  %465 = load i8, i8* %453, align 8, !dbg !99343, !range !15343, !noalias !98802
  %switch.i.i254.i = icmp ult i8 %465, 2, !dbg !99343
  br i1 %switch.i.i254.i, label %bb61.i, label %bb2.i.i.i235, !dbg !99343

bb2.i.i.i235:                                     ; preds = %bb60.i
  %466 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !99343
  %467 = bitcast i8* %466 to %"std::io::error::Custom"**, !dbg !99343
  %468 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %467, align 8, !dbg !99346, !noalias !98802, !nonnull !6
  %469 = bitcast %"std::io::error::Custom"* %468 to {}**, !dbg !99348
  %470 = load {}*, {}** %469, align 8, !dbg !99348, !nonnull !6
  %471 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %468, i64 0, i32 1, i32 1, !dbg !99348
  %472 = bitcast [3 x i64]** %471 to void ({}*)***, !dbg !99348
  %473 = load void ({}*)**, void ({}*)*** %472, align 8, !dbg !99348, !nonnull !6
  %474 = load void ({}*)*, void ({}*)** %473, align 8, !dbg !99348, !invariant.load !6, !nonnull !6
  invoke void %474({}* nonnull %470)
          to label %bb3.i.i.i.i.i255.i unwind label %cleanup.i.i.i.i.i.i238, !dbg !99348

2020-07-06:

bb63.i:                                           ; preds = %bb62.i
  call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %413), !dbg !100148, !noalias !99592
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %335), !dbg !100148, !noalias !99592
  %421 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !100157
  store {}* null, {}** %421, align 8, !dbg !100157, !noalias !99592
  %422 = load i8, i8* %410, align 8, !dbg !100158, !range !19089, !noalias !99592
  %switch.i.i283.i = icmp ult i8 %422, 2, !dbg !100158
  br i1 %switch.i.i283.i, label %bb64.i, label %bb2.i.i.i, !dbg !100158

bb2.i.i.i:                                        ; preds = %bb63.i
  %423 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !100158
  %424 = bitcast i8* %423 to %"std::io::error::Custom"**, !dbg !100158
  %425 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %424, align 8, !dbg !100161, !noalias !99592, !nonnull !6
  %426 = bitcast %"std::io::error::Custom"* %425 to {}**, !dbg !100163
  %427 = load {}*, {}** %426, align 8, !dbg !100163, !noalias !99730, !nonnull !6
  %428 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %425, i64 0, i32 1, i32 1, !dbg !100163
  %429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
  %430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
  %431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6
  invoke void %431({}* nonnull %427)
          to label %bb3.i.i.i.i.i285.i unwind label %cleanup.i.i.i.i.i287.i, !dbg !100163, !noalias !99730
cuviper commented 4 years ago

there are additional !noalias annotations in the one with the miscompilation.

Box gets special treatment -- PointerKind::UniqueOwned -- did something about your patch affect that visibility?

I'm guessing, maybe before it was dropped indirectly through &mut (Shared pending #54878), but with the more precise drops it may be dropped by value? (UniqueOwned does get noalias.) Or if it used a raw *mut before (like drop_in_place), that doesn't get any PointerKind annotations. Of course the Drop trait uses &mut, but I don't know what it looks like when an implicit MIR drop is elaborated.

ecstatic-morse commented 4 years ago

In the MIR for the "precise" version, we move the error variant into a local, then drop the local.

_63 = move ((_41 as Err).0)
...
drop(_63)

In the MIR for the imprecise version, we check the drop flag for the Err variant, then drop that projection.

drop(((_41 as Err).0))

As you alluded to, perhaps there's some difference in how we were emitting LLVM IR for drops of locals vs drops of projections? I'm not an expert on codegen. However, given that this miscompilation is present on the latest master even without "precise" drop elaboration, we must have eliminated this difference at some point. I'll check the LLVM IR for the latest nightly to be sure. Indeed, the noalias declarations persist even with precise drop elaboration disabled on a recent master:

#77423 -Zprecise-enum-drop-elaboration=no ```llvm bb63.i: ; preds = %bb62.i call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %460), !dbg !104817, !noalias !104241 call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %353), !dbg !104817, !noalias !104241 %468 = bitcast %"std::option::Option>"* %signature.i to {}**, !dbg !104826 store {}* null, {}** %468, align 8, !dbg !104826, !noalias !104241 %469 = load i8, i8* %457, align 8, !dbg !104827, !range !3262, !noalias !104241 %switch.i.i.i = icmp ult i8 %469, 2, !dbg !104827 br i1 %switch.i.i.i, label %bb107.i, label %bb2.i.i283.i, !dbg !104827 bb2.i.i283.i: ; preds = %bb63.i %470 = getelementptr inbounds %"std::io::Error", %"std::io::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !104827 %471 = bitcast i8* %470 to %"std::io::error::Custom"**, !dbg !104827 %472 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %471, align 8, !dbg !104669, !noalias !104241, !nonnull !6 %473 = bitcast %"std::io::error::Custom"* %472 to {}**, !dbg !104663 %474 = load {}*, {}** %473, align 8, !dbg !104663, !noalias !104430, !nonnull !6 %475 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %472, i64 0, i32 1, i32 1, !dbg !104663 %476 = bitcast [3 x i64]** %475 to void ({}*)***, !dbg !104663 %477 = load void ({}*)**, void ({}*)*** %476, align 8, !dbg !104663, !noalias !104430, !nonnull !6 %478 = load void ({}*)*, void ({}*)** %477, align 8, !dbg !104663, !invariant.load !6, !noalias !104430, !nonnull !6 invoke void %478({}* nonnull %474) to label %bb3.i.i.i.i.i284.i unwind label %bb44.i, !dbg !104663, !noalias !104430 ```

One thing occurs to me now that didn't last night: These lines seem to be checking the vtable for the boxed trait object in io::Error::Custom:

 %429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
 %430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
 %431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6

However, trait object vtables can definitely alias globally. I don't see how this kind of error would cause the miscompilation we're seeing, but as I said I don't have much LLVM expertise. Regardless, it seems like something we should fix?

cuviper commented 4 years ago

However, trait object vtables can definitely alias globally.

The actual vtable should be shared read-only, which is fine for LLVM noalias.

cuviper commented 4 years ago

Let's see what LLVM devs say: https://bugs.llvm.org/show_bug.cgi?id=47736

Aaron1011 commented 4 years ago

It looks like an LLVM patch was accepted: https://reviews.llvm.org/D89034

cuviper commented 4 years ago

I'll backport that patch after the LLVM 11.0.0 final rebase lands -- #77948. (I thought about doing that together, but figured it would be better to separate concerns.)