Closed cuviper closed 4 years ago
cc #74551, another miscompilation on a tier 2 architecture that appeared after a similar optimization. If the root cause is the same, we should have a better chance of diagnosing it now, since it occurs in a crate's test case instead of while compiling Firefox. cc @pnkfelix as well.
Assigning P-critical
as discussed as part of the Prioritization Working Group procedure and removing I-prioritize
.
Is -Zmir-opt-level=0
supposed to avoid #73879's changes? That was mentioned in zulip #t-compiler/meetings
, but I still get the crash with that option on nightly-2020-07-06.
No. That person was mistaken. Drop elaboration runs unconditionally as part of "post_borrowck_cleanup
".
Updating for cross-reference: I tried -Zprecise-enum-drop-elaboration=no
in https://github.com/rust-lang/rust/pull/77423#issuecomment-702900803, but it didn't help.
As a sanity check, was there any difference in the generated MIR for <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources
? I tested the flag locally and it worked, but we should make sure. If there's no trivial error with the flag, I'm not sure what's going on.
I guess we could with the more precise version of MaybeInitializedPlaces
but the less precise version of MaybeUninitializedPlaces
(that was the status quo before #73879 ). I would be surprised if we had a miscompilation when using the precise version of both (master
) or the imprecise version of both (master
with -Zprecise-enum-drop-elaboration
or a release prior to #68528) but not with one precise and one imprecise (a release between #68528 and #73879).
Here are the test's target/release/deps/libcoreinst*
after --emit=mir,llvm-ir -Csave-temps
:
toolchain | download | |
---|---|---|
:heavy_check_mark: | nightly-2020-07-05 | target-2020-07-05.tar.gz |
:x: | nightly-2020-07-06 | target-2020-07-06.tar.gz |
:x: | nightly-2020-10-01 | target-2020-10-01.tar.gz |
:x: | #77423 -Zprecise-enum-drop-elaboration=no |
target-pr77423.tar.gz |
They were compiled on Fedora Rawhide (34), in case anyone wants to find libraries to actually run the test executables.
Here's std::io::error::Repr
for reference:
enum Repr {
Os(i32),
Simple(ErrorKind),
Custom(Box<Custom>),
}
From <libcoreinst::source::FileLocation as libcoreinst::source::ImageLocation>::sources
:
Err(err) => {
eprintln!("Couldn't read signature file: {}", err);
None
}
The error we get during the test case is Repr::Os(2)
, ENOENT
, "No such file or directory", which prints fine.
With nightly-2020-07-05
, the assembly after printing is:
+3102> brasl %r14,0x2aa0034d340 <std::io::stdio::_eprint>
+3108> cli 504(%r15),2
+3112> mvghi 704(%r15),0
+3118> jl 0x2aa000cd5b8 <libcoreinst::download::tests::test_write_image_limit+3480>
+3122> lg %r13,512(%r15)
+3128> lg %r1,8(%r13)
+3134> lg %r2,0(%r13)
I believe that cli
2 is comparing the Repr
tag, where only Custom
needs dropping. When I step through, that jl
branch is taken to skip ahead. Otherwise, you can see that it would load 512(%r15)
(offset 8 from before) for the Box
into %r13
, then proceed to dereference that pointer.
With nightly-2020-07-06
, the assembly after printing is:
+4514> brasl %r14,0x2aa00349d40 <std::io::stdio::_eprint>
+4520> cli 552(%r15),2
+4524> llilh %r1,16
+4528> agr %r1,%r15
+4532> mvghi 768(%r1),0
+4538> jl 0x2aa000cc2ce <libcoreinst::download::tests::test_write_image_limit+4718>
+4542> lg %r13,560(%r15)
+4548> lg %r1,8(%r13)
+4554> lg %r2,0(%r13)
Looks similar with cli
2, but I don't really speak SystemZ to guess what llilh
and agr
are doing. But in the debugger I see that this jl
branch is not taken, and then it loads a garbage pointer into %r13
(always seems to be 0x21
). The load of 8(%r13)
is where we SIGSEGV
.
Found the instruction reference...
LLILH
is Load Logical Immediate (low high)AGR
is Add (64), Condition code setSo it seems to be clobbering the CLI
condition.
I'm guessing the revised enum drops just exposed an existing codegen bug.
Nice!
Since I didn't want you to have all the fun, here is the LLVM IR for the relevant section. The first basic block is the comparison with the enum discriminant of io::Error
and the second is the first piece of drop glue. The structure looks to be the exact same, but there are additional !noalias
annotations in the one with the miscompilation. I think these are correct in a vacuum, but I don't have enough LLVM knowledge to say for sure. I certainly don't have enough to hypothesize why they caused the instruction scheduler to forget all about the need to preserve flags/condition codes.
2020-07-05:
bb60.i: ; preds = %bb59.i
call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %456), !dbg !99333, !noalias !98802
call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %383), !dbg !99333, !noalias !98802
%464 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !99342
store {}* null, {}** %464, align 8, !dbg !99342, !noalias !98802
%465 = load i8, i8* %453, align 8, !dbg !99343, !range !15343, !noalias !98802
%switch.i.i254.i = icmp ult i8 %465, 2, !dbg !99343
br i1 %switch.i.i254.i, label %bb61.i, label %bb2.i.i.i235, !dbg !99343
bb2.i.i.i235: ; preds = %bb60.i
%466 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !99343
%467 = bitcast i8* %466 to %"std::io::error::Custom"**, !dbg !99343
%468 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %467, align 8, !dbg !99346, !noalias !98802, !nonnull !6
%469 = bitcast %"std::io::error::Custom"* %468 to {}**, !dbg !99348
%470 = load {}*, {}** %469, align 8, !dbg !99348, !nonnull !6
%471 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %468, i64 0, i32 1, i32 1, !dbg !99348
%472 = bitcast [3 x i64]** %471 to void ({}*)***, !dbg !99348
%473 = load void ({}*)**, void ({}*)*** %472, align 8, !dbg !99348, !nonnull !6
%474 = load void ({}*)*, void ({}*)** %473, align 8, !dbg !99348, !invariant.load !6, !nonnull !6
invoke void %474({}* nonnull %470)
to label %bb3.i.i.i.i.i255.i unwind label %cleanup.i.i.i.i.i.i238, !dbg !99348
2020-07-06:
bb63.i: ; preds = %bb62.i
call void @llvm.lifetime.end.p0i8(i64 48, i8* nonnull %413), !dbg !100148, !noalias !99592
call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %335), !dbg !100148, !noalias !99592
%421 = bitcast %"core::option::Option<alloc::vec::Vec<u8>>"* %signature.i to {}**, !dbg !100157
store {}* null, {}** %421, align 8, !dbg !100157, !noalias !99592
%422 = load i8, i8* %410, align 8, !dbg !100158, !range !19089, !noalias !99592
%switch.i.i283.i = icmp ult i8 %422, 2, !dbg !100158
br i1 %switch.i.i283.i, label %bb64.i, label %bb2.i.i.i, !dbg !100158
bb2.i.i.i: ; preds = %bb63.i
%423 = getelementptr inbounds %"std::io::error::Error", %"std::io::error::Error"* %err4.i, i64 0, i32 1, i32 2, i64 7, !dbg !100158
%424 = bitcast i8* %423 to %"std::io::error::Custom"**, !dbg !100158
%425 = load %"std::io::error::Custom"*, %"std::io::error::Custom"** %424, align 8, !dbg !100161, !noalias !99592, !nonnull !6
%426 = bitcast %"std::io::error::Custom"* %425 to {}**, !dbg !100163
%427 = load {}*, {}** %426, align 8, !dbg !100163, !noalias !99730, !nonnull !6
%428 = getelementptr inbounds %"std::io::error::Custom", %"std::io::error::Custom"* %425, i64 0, i32 1, i32 1, !dbg !100163
%429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
%430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
%431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6
invoke void %431({}* nonnull %427)
to label %bb3.i.i.i.i.i285.i unwind label %cleanup.i.i.i.i.i287.i, !dbg !100163, !noalias !99730
there are additional
!noalias
annotations in the one with the miscompilation.
Box
gets special treatment -- PointerKind::UniqueOwned
-- did something about your patch affect that visibility?
I'm guessing, maybe before it was dropped indirectly through &mut
(Shared
pending #54878), but with the more precise drops it may be dropped by value? (UniqueOwned
does get noalias
.) Or if it used a raw *mut
before (like drop_in_place
), that doesn't get any PointerKind
annotations. Of course the Drop
trait uses &mut
, but I don't know what it looks like when an implicit MIR drop
is elaborated.
In the MIR for the "precise" version, we move the error variant into a local, then drop the local.
_63 = move ((_41 as Err).0)
...
drop(_63)
In the MIR for the imprecise version, we check the drop flag for the Err
variant, then drop that projection.
drop(((_41 as Err).0))
As you alluded to, perhaps there's some difference in how we were emitting LLVM IR for drops of locals vs drops of projections? I'm not an expert on codegen. However, given that this miscompilation is present on the latest master even without "precise" drop elaboration, we must have eliminated this difference at some point. I'll check the LLVM IR for the latest nightly to be sure. Indeed, the noalias
declarations persist even with precise drop elaboration disabled on a recent master
:
One thing occurs to me now that didn't last night: These lines seem to be checking the vtable for the boxed trait object in io::Error::Custom
:
%429 = bitcast [3 x i64]** %428 to void ({}*)***, !dbg !100163
%430 = load void ({}*)**, void ({}*)*** %429, align 8, !dbg !100163, !noalias !99730, !nonnull !6
%431 = load void ({}*)*, void ({}*)** %430, align 8, !dbg !100163, !invariant.load !6, !noalias !99730, !nonnull !6
However, trait object vtables can definitely alias globally. I don't see how this kind of error would cause the miscompilation we're seeing, but as I said I don't have much LLVM expertise. Regardless, it seems like something we should fix?
However, trait object vtables can definitely alias globally.
The actual vtable should be shared read-only, which is fine for LLVM noalias
.
Let's see what LLVM devs say: https://bugs.llvm.org/show_bug.cgi?id=47736
It looks like an LLVM patch was accepted: https://reviews.llvm.org/D89034
I'll backport that patch after the LLVM 11.0.0 final rebase lands -- #77948. (I thought about doing that together, but figured it would be better to separate concerns.)
xref: https://bugzilla.redhat.com/show_bug.cgi?id=1883457
In the Fedora build of coreos-installer 0.7.0, one of the tests crashed with
SIGSEGV
on s390x, while it passed on every other architecture. The reporter in Red Hat bugzilla found that it happens equally with system LLVM 11 or 10. The crate'sCargo.toml
also haslto = true
, but they didn't see any improvement after turning that off.The distro builds run tests in release mode (to avoid recompiling from the main build) with flags in
.cargo/config
:The tests pass if we don't set
-Ccodegen-units=1
, which suggests to me that some relevant (mis)optimization is only triggered with the full unit.The tests are also fine with Fedora's rust-1.45.2, only crashing on rust-1.46.0. I also reproduced the crash on rustup's stable 1.46.0, and I used
cargo bisect-rustc
to narrow down tonightly-2020-07-05
passing, whilenightly-2020-07-06
crashes. That's 0cd7ff7ddfb75a38dca81ad3e76b1e984129e939...2753fab7ce3647033146b07c8b6c9f4856a910b0, which is just #73879 (cc @ecstatic-morse). I don't know ifcodegen-units
would affect that, or if it's just triggering something new in LLVM.Here's the reporter's backtrace: