Closed novacrazy closed 4 years ago
Here is the verbose build output:
The errors are the same with a clean build, but I used a subsequent attempt to cut down on the log size.
A simplified test case is simply adding cargo_metadata
to an empty crate.
[package]
name = "cpu-bug"
version = "0.1.0"
authors = ["novacrazy <novacrazy@gmail.com>"]
edition = "2018"
[dependencies]
cargo_metadata = "0.8.2"
[profile.release] # My release profile
opt-level = 3
lto = 'fat'
incremental = false
debug-assertions = false
codegen-units = 1
extern crate cargo_metadata;
fn main() {
println!("Hello, world!");
}
$env:RUSTFLAGS = "-C target-cpu=znver1"
cargo run --release
codegen-units=1
seems to be partially responsible. Removing that fixes it. So it's not LTO at least.
I can also trigger this on the dev
profile by changing:
[profile.dev]
opt-level = 3
codegen-units = 1
with RUSTFLAGS = "-C target-cpu=znver1"
opt-level=2
does not trigger it.
Checking in from @rust-lang/compiler triage:
This seems to be related to our LLVM upgrade. The linked issue (#63361) was blamed on LLVM bug 42935 and fixed by @nikic via a LLVM submodule update (https://github.com/rust-lang/rust/pull/63415).
cc @nikic and @nagisa -- Any thoughts on what's going on here?
Tagging as P-high for now. Not sure who to assign to.
(Sound this be labeled as I-unsound?)
Is it possible to get a backtrace for the segfault? I don't have a windows system (or a zen system for that matter) to reproduce this on.
I don't know whether Windows has assertion-enabled builds, but if it does, it might be worth calling https://github.com/kennytm/rustup-toolchain-install-master with the -a
argument and check if the toolchain this downloads triggers an assertion failure.
(Sound this be labeled as I-unsound?)
Probably not. The crash is in the compiler process, not in code it output, and presumably in C++ code at that, so there's no reason to expect there's anything going wrong with any safe Rust code. While it's the scary sort of crash that sounds like it could hypothetically also result in completely bogus machine code being generated, there's no evidence of this actually happening / being possible. I mean I guess we could decide to tag stuff I-unsound merely because "something's gone really wrong in the C++ and that could have arbitrarily bad consequences", but if we do that we should also blanket tag all LLVM assertion failures as I-unsound, but we don't currently do that nor do I think it would be useful.
Could not reproduce with https://github.com/rust-lang/rust/issues/63959#issuecomment-525482889 or https://github.com/rust-lang/rust/issues/63959#issuecomment-525515756 on on Zen 2000 based system using Linux GNU and by cross compiling to Windows GNU, I'll check native Windows GNU toolchain later.
Happens with both MSVC and GNU builds on Windows for me.
How would I go about enabling backtraces on a Windows build of rustc? EDIT: Would rustc even produce a backtrace on segfault? I know I've seen proper backtraces with ICEs, but this is different.
9b91b9c10e3c87ed333a1e34c4f46ed68f1eee06-alt
(just the alt version of the last nightly I had) does not appear to respond to RUST_BACKTRACE=1
It's not rustc
panic but LLVM segfault so you should use gdb
with Windows GNU toolchain, no idea about MSVC.
On Windows rustc
exits with 0xc0000005
and GDB only prints: No stack.
. There are no alternative builds for Windows GNU toolchain so I won't be able to do anything until I do debug build.
On Windows rustc exits with 0xc0000005 and GDB only prints: No stack.. There are no alternative builds for Windows GNU toolchain so I won't be able to do anything until I do debug build.
Make sure you running the real rustc
and not the wrapper from rustup. What you’re seeing here is a typical symptom of failing to account for the wrapper.
Hmm, I could swear I could debug rustc
crash on Linux without caring about the wrapper.
Anyway the stack is corrupt:
Other threads are just waiting.
I'll return with debug Rust build if I somehow manage to build LLVM in debug mode on PC with 16 GiB of RAM...
What led you to the conclusion of a corrupt stack? It looks fairly reasonable to me.
Hmm, I could swear I could debug
rustc
crash on Linux without caring about the wrapper.
I think the wrapper switched to exec
at some point (i.e. the process gets reused, without forking).
AFAIK, that would allow debugging to continue to the real rustc
.
I don't think anything like this is possible on Windows (without manually loading the executable you want to run in your address space, of course).
I'll return with debug Rust build if I somehow manage to build LLVM in debug mode on PC with 16 GiB of RAM...
You don't need to, this is not in LLVM, it's in syn
. You can probably reproduce with just cargo check
(or rustc --emit=meta
/ rustc --pretty=expanded
).
What led you to the conclusion of a corrupt stack? It looks fairly reasonable to me.
Process gave me exit code for stack corruption before, took quick look at the trace and it didn't make any sense to me. On second look I noticed the crash happened inside proc macro...
You don't need to, this is not in LLVM, it's in syn. You can probably reproduce with just cargo check (or rustc --emit=meta / rustc --pretty=expanded).
Yeah, it hit me later. Running cargo check
on cargo-metadata
crate with changes from https://github.com/rust-lang/rust/issues/63959#issuecomment-525515756 reliably reproduces it.
In case you find it useful here is trace from debug build and disassembly: https://gist.github.com/mati865/e93d3bf12408df00ecf47327fa196af7
Assembly for znver1
and generic get_ident
is the same so the problem is somewhere earlier. What is the best way to proceed here, compiler with assertions or tearing down cargo-metadata
to something more handy?
@mati865 Since AFAICT the bug happens during the macro expansion in cargo-metadata
, you should be able to get rid of most of it.
Not even names need to be resolvable, other than invoking serde_derive
's macros.
So, for example, you can remove all dependencies of cargo-metadata
, other than serde_derive
, because they're not needed in the reproduction.
I'm worried this is a miscompilation of rustc
/std
itself, at this point.
EDIT: wait, no, it must be code compiled with -C target-cpu
that's getting miscompiled, so it's all within serde_derive
/syn
.
Could you try to run ./x.py test --stage 1 src/test/ui
with -C target-cpu=znver1
hardcoded somewhere? (presumably in src/tools/compiletest
)
I've been struggling few past 2 days to build Rust because of https://github.com/rust-lang/rust/issues/61561 so I'm unable to progress on this issue.
Any luck with this? I'm still stuck on 07e0c3651ce2a7b326f7678e135d8d5bbbbe5d18 because of it.
Odd. On my personal project, this appears mostly fixed on the latest nightly (6ef275e6c 2019-09-24
), but the issue still occurs on the cargo_metadata
example.
visiting for triage. Its not clear to me whom on our team can investigate this; has anyone from @rust-lang/compiler managed to reproduce this locally? Sounds like the answer to that is likely "no."
Just noticed it's fully broken again on 702b45e40 2019-10-01
.
triage: @mati865 are you still looking at this? should I try to identify someone else with this hardware setup who might be able to help further?
I'll try to find time to setup environment with older C toolchain today (to workaround https://github.com/rust-lang/rust/issues/61561).
Status update:
I built current master for windows-gnu
with verify-llvm-ir
and assertions for both rustc and LLVM but it does not reproduce the crash.
I can still reproduce in on nightly though.
visiting for triage.
@mati865 , things certainly sound fishy (and frustrating) when a local build does not replicate the problem but the CI-produced nightly artifacts does.
Would it be feasible for you to replicate something closer to what the CI setup is, e.g. via a docker image?
Finally reproduced with local build but experiment proposed by @eddyb only broke 4 unrelated tests (cross compilation, simd detection...).
Tested passing either -Ctarget-cpu=znver1
or -Ctarget-cpu=znver1 -Copt-level=3 -Ccodegen-units=1
.
Back to the drawing board I guess.
Also had this happen spuriously on the Windows GNU toolchain (latest nightly), but running cargo build
again continued where it left off and finished just fine.
Is there anything I can do to help speed up fixing this issue? It's been nearly two months since I've been able to compile benchmarks, and it's interfering with my work.
I don't actually use cargo_metadata
directly, it was just one of the many crates that was previously crashing rustc. However, on my system criterion
almost always crashes rustc, or even somehow compiles but hangs during execution (#65618). It's very unpredictable, and sometimes works on tiny one-crate projects, but for my primary work project workspace, it's totally unusable. Even compiling non-criterion benchmarks crashes rustc sometimes. I can't even compile Criterion as a regular crate binary (rather than using the bench
profile).
Amazingly my primary binary somehow started compiling again about a month ago, or else I'd be totally screwed, but everything else is broken on Zen 1. At this rate I'll have to roll my own benchmarking tools to continue my work.
Removed previous comment as it was just PEBCAK.
@eddyb not sure it it helps but I reduced cargo_metadata
to this:
extern crate serde;
#[macro_use]
extern crate serde_derive;
#[derive(Serialize)]
pub enum A {
///
B,
}
It does not reproduce directly but crashes when another crate depends on crate with this snippet.
It appears it requires specific combination: derive (de)serialize, enum, doc comment inside, -Ctarget-cpu=znver1
rustc argument and Windows as the OS. Otherwise it just works.
Anybody got ideas?
You should use cargo expand
to get the expanded version of that and start reducing.
I doubt you need serde, just a few traits from it that you could copy over etc.
EDIT: disregard me, that makes no sense, the crash happens in the proc macro.
@eddyb I tried cargo expand
while working on latest comment. It produced this snippet
but it does not crash when using it from another crate:
#![feature(prelude_import)]
#![no_std]
#[prelude_import]
use ::std::prelude::v1::*;
#[macro_use]
extern crate std;
extern crate serde;
#[macro_use]
extern crate serde_derive;
pub enum A {
///
B,
}
#[allow(non_upper_case_globals, unused_attributes, unused_qualifications)]
const _IMPL_SERIALIZE_FOR_A: () = {
#[allow(unknown_lints)]
#[allow(rust_2018_idioms)]
extern crate serde as _serde;
#[automatically_derived]
impl _serde::Serialize for A {
fn serialize<__S>(&self, __serializer: __S) -> _serde::export::Result<__S::Ok, __S::Error>
where
__S: _serde::Serializer,
{
match *self {
A::B => _serde::Serializer::serialize_unit_variant(__serializer, "A", 0u32, "B"),
}
}
}
};
I'm available on both Discord and Zulip btw.
I encountered another odd behavior with these crashes when creating a crate specifically for custom benchmarks in my workspace.
The file structure is simply the following, with only one binary, because I just started:
src/
lib.rs
bin/
bench1.rs
Running the specific binary like cargo run bench1 --release
appears to work fine right now, no rustc crashes.
However, earlier I accidentally typed cargo run --release
instead of cargo run bench1 --release
, and it crashed on compiling one of the other crates in the workspace. One that was already compiled and cached, at that.
Normally it would simply compile the only binary available, but instead it crashes, as if something about specifying the binary name prevents it from crashing. I thought that odd enough to mention.
Actually, I take that back. I was passing the binary name wrong,
If I instead do cargo run --bin bench1 --release
, it also crashes, even with a single binary.
This makes it impossible to have more than one binary, so I guess I'm screwed again.
One binary magically works, but only if I pass its name incorrectly.
Apparently the presence of some crates can magically allow other (previously crashing) crates to successfully compile. Adding structopt
unbreaks things in same cases, probably because of some chain of features reaching all the way down into syn
.
Something is seriously broken in rustc, and I'm amazed this isn't getting more attention. I'm betting my personal future on Rust, and it can't even compile benchmarks on my system.
Is AMD not a supported platform anymore?
EDIT: I apologize. It's easy to become emotional with so much personal time invested in projects.
Is AMD not a supported platform anymore?
I works perfectly fine on Linux when targetting znver1.
Something is seriously broken in rustc
Something is broken in LLVM but it's very hard to narrow it.
I'm betting my personal future on Rust, [...]
I do not recommend doing this.
@eddyb I tried
cargo expand
while working on latest comment. It produced this snippet but it does not crash when using it from another crate:
Oops, while suggesting cargo expand
I forgot that the crash is in the proc macro itself.
So far, I think what's happening here is:
serde_derive
) is compiled with target-cpu=znver1
Ideally we should be able to get a reproduction without part 2, i.e. without the compiler being involved in executing the miscompiled code. This could be done with e.g. proc-macro2
.
I realized recently that we happen to have one laptop with a Ryzen CPU in the office, if I get access to it I'll post the results of trying to reproduce/reduce this on it.
Confirmed repro on that laptop with RUSTFLAGS=-Ctarget-cpu=znver1
and:
[package]
edition = "2018"
[dependencies]
serde = "1"
serde_derive = "1"
[profile.release]
codegen-units = 1
#[derive(serde_derive::Serialize)]
enum A {
#[allow()] X,
}
If you need something simpler than serde
, I learned this morning that no_panic also triggers the miscompilation sometimes:
[package]
edition = "2018"
[dependencies]
no-panic = "0.1.11"
[profile.release]
codegen-units = 1
use no_panic::no_panic;
#[no_panic]
fn demo(s: &str) -> &str {
&s[1..]
}
fn main() {
println!("{}", demo("input string"));
}
Running `rustc --edition=2018 --crate-name cpu_bug src\main.rs --color always --crate-type bin --emit=dep-info,link -C opt-level=3 -C codegen-units=1 -C debuginfo=2 -C debug-assertions=on -C metadata=2a806a81570c94f1 --out-dir F:\code\projects\bugs\cpu-bug\target\debug\deps -C incremental=F:\code\projects\bugs\cpu-bug\target\debug\incremental -L dependency=F:\code\projects\bugs\cpu-bug\target\debug\deps --extern no_panic=F:\code\projects\bugs\cpu-bug\target\debug\deps\no_panic-9d376b9b716382ce.dll -C target-cpu=znver1`
error: could not compile `cpu-bug`.
Caused by:
process didn't exit successfully: `rustc --edition=2018 --crate-name cpu_bug src\main.rs --color always --crate-type bin --emit=dep-info,link -C opt-level=3 -C codegen-units=1 -C debuginfo=2 -C debug-assertions=on -C metadata=2a806a81570c94f1 --out-dir F:\code\projects\bugs\cpu-bug\target\debug\deps -C incremental=F:\code\projects\bugs\cpu-bug\target\debug\incremental -L dependency=F:\code\projects\bugs\cpu-bug\target\debug\deps --extern no_panic=F:\code\projects\bugs\cpu-bug\target\debug\deps\no_panic-9d376b9b716382ce.dll -C target-cpu=znver1` (exit code: 0xc0000374, STATUS_HEAP_CORRUPTION)
So far reduced to cargo run --release
with RUSTFLAGS=-Ctarget-cpu=znver1
and:
[dependencies]
syn = "1"
[profile.release]
codegen-units = 1
use syn::parse::{Parse, Parser};
fn main() {
syn::DeriveInput::parse.parse_str("enum A { X }").unwrap();
}
No more proc macros are involved, eliminating part 2 from https://github.com/rust-lang/rust/issues/63959#issuecomment-549774373.
However, this leaves the entirety of syn
left to reduce.
I'm now left with ~1000 lines of proc-macro2
and ~300 lines of syn
, no dependencies other than std
left, not even on the builtin proc_macro
(which is good, because I'd rather not reduce that).
As you might be able to tell from types such as [u64; 26]
, certain types appear to only matter for their size, and there's also a dance like let x = *Box::new(x);
at some point.
Would be interesting to see if I can remove all of the unsafe
code from the syn
side, because until then it will remain a potential suspect, however unlikely that may be.
The actual crash appears to be happening as a direct result of Cursor::bump
. Somehow self.ptr
is pointing at the end of an array, and incrementing past that causes the corruption. You can check with
unsafe fn bump(self) -> Cursor<'a> {
assert_ne!(self.ptr, self.scope);
Cursor::create(self.ptr.offset(1), self.scope)
}
which shows different output with/without codegen-units=1
EDIT: No, I changed something and that caused different results... but the following is still true
It looks as if the comparison of ptr == scope
here is being optimized away:
unsafe fn create(mut ptr: *const Entry, scope: *const Entry) -> Self {
while let Entry::End(exit) = *ptr {
if ptr == scope {
break;
}
ptr = exit;
}
Cursor {
ptr,
scope,
marker: PhantomData,
}
}
if you change it to std::hint::black_box(ptr) == std::hint::black_box(scope)
, that fixes everything.
So it could be that the creation of ptr
and scope
is undefined behavior.
After lots of fiddling with it, and some false leads, the highest up I can fix it is by changing:
let value = Box::new(value);
inner.push(*value);
to
let value = Box::new(value);
inner.push(std::hint::black_box(*value));
Seems like that is the root cause of the invalid optimization, which was even stated in your last comment @eddyb . Suppose I should have tried that first.
As of Rust 1.39.0, this is now in Stable.
FWIW I hit this on a Ryzen 3000 processor as well. I couldn't repro the code in this comment. My repro is a bit larger (sorry, I spent a couple hours minimizing as much as I could) but is essentially:
#![feature(specialization)]
use pyo3::ffi::Py_TYPE;
use pyo3::prelude::*;
use pyo3::types::{IntoPyDict, PyType};
#[pyfunction]
pub fn loads<'a>(s: PyObject, py: Python) -> PyResult<PyType> {
unsafe {
let p = s.as_ptr();
let tp = Py_TYPE(p);
PyType::from_type_ptr(py, tp)
}
}
[package]
name = "simple"
version = "0.1.0"
authors = ["None <None>"]
edition = "2018"
[dependencies.pyo3]
version = "0.8"
features = ["extension-module"]
[profile.release]
codegen-units = 1
On a different note, given tier 1 platforms are "guaranteed to work", I am surprised this was allowed into stable given x86_64-pc-windows-msvc
is a tier 1 platform. I totally understand that resources are spread thin and people are busy, but I just wanted to point this out so that if it was undesired, some thought could go into ways of avoiding release blocker bugs creeping into stable. Thank you for your awesome programming language :)
On any recent MSVC nightly, compiling with
release
profile withRUSTFLAGS = "-C target-cpu=native"
results in eitherSTATUS_ACCESS_VIOLATION
orSTATUS_HEAP_CORRUPTION
depending on the crate. Many crates work, but others don't. Among those that fail some use SIMD.target-cpu=native
resolves totarget-cpu=znver1
on my machine.This seems to be related #63361 and the LLVM upgrade, again. It does not happen when
target-cpu
is not set.Everything works on https://github.com/rust-lang/rust/commit/07e0c3651ce2a7b326f7678e135d8d5bbbbe5d18 but fails after https://github.com/rust-lang/rust/commit/38798c6d68394874686dfa3d03e56e12a3ff3d54, same as the aforementioned issue. I am not sure how to reproduce it in a single crate, but I will look into it.
LLVM 9 just doesn't like AMD.
Although, another issue of mine: https://github.com/CraneStation/cranelift/issues/900 also fails in a similar manner before the LLVM upgrade, so it's worth noting.