rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
99.08k stars 12.79k forks source link

`thread_local!` initialization code panics on some `aarch64-apple-darwin` envs, works on others #132704

Open dj8yfo opened 3 weeks ago

dj8yfo commented 3 weeks ago

I tried this code:

https://github.com/near/near-sdk-rs/blob/master/near-sdk/src/environment/mock/mod.rs#L11-L16

thread_local! {
    static BLOCKCHAIN_INTERFACE: RefCell<MockedBlockchain>
         = RefCell::new(MockedBlockchain::default());
}

I expected to see this happen:

thread_local! initialization code for RefCell::new(MockedBlockchain::new) works

Instead, this happened:

on unclarified aarch64-apple-darwin environment(s) thread_local! panics on https://doc.rust-lang.org/src/core/ptr/mod.rs.html#1277 .

The issue doesn't reproduce on other aarch64-apple-darwin real macs, and on macos-14-arm64 github actions vm.

Original issue: https://github.com/near/near-sdk-rs/issues/1252

Meta

rustc --version --verbose:

stable: 1.80, 1.81, 1.82 panics
beta 1.83 panics
nightly 1.84 panics

backtrace from original issue: https://github.com/near/near-sdk-rs/issues/1252#issuecomment-2454692564

dj8yfo commented 3 weeks ago

Somewhat limited pre-analysis (can be safely ignored)

ub_checks::assert_unsafe_precondition! check inside of https://github.com/rust-lang/rust/blob/master/library/std/src/sys/thread_local/native/lazy.rs#L66 fails, because earlier LazyStorage::new failed to put State::Initial into an adress with correct alignment for State<T,D> https://github.com/rust-lang/rust/blob/master/library/std/src/sys/thread_local/native/lazy.rs#L37, where T is RefCell<MockedBlockchain<Memory = MockedMemory>>

related case

earlier we had a similar issue with aarch64-apple-darwin with somewhat similar panics:

0x102f569ec - core::ptr::replace::he9e2c42f9fe96d9b
                  at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ub_checks.rs:73:17
0x102f569ec - core::ptr::mut_ptr::<impl *mut T>::replace::hcc3425bab485c184
                  at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ptr/mut_ptr.rs:1560:18
0x102f569ec - std::sys::thread_local::lazy::LazyKeyInner<T>::take::h11414c1b64d26b2a
                  at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys/thread_local/mod.rs:102:30

which originated in some register_dtor functions when trying to run doc-tests for the same near-sdk-rs project : https://github.com/rust-lang/rust/issues/126643

It was later confirmed that the issue stopped to manifest itself on exactly the same rustc 1.79.0 (129f3b996 2024-06-10) compiler after macos-14-arm64 github actions vm was updated.

madsmtm commented 2 weeks ago

I tried going through the linked issues, but I couldn't figure out how you reproduce the issue (what commands do you run?).

A hypothetical minimal reproducer based on the code you posted doesn't reproduce on my Mac (M2, macOS 14.7):

use std::cell::RefCell;

#[repr(align(16))]
#[derive(Debug)]
struct MockedBlockchain {
    _size: [u8; 2512],
}

thread_local! {
    static BLOCKCHAIN_INTERFACE: RefCell<MockedBlockchain> = RefCell::new(MockedBlockchain {
        _size: [1; 2512],
    });
}

fn main() {
    BLOCKCHAIN_INTERFACE.with(|b| println!("{b:?}"));
}

So it sounds like it might be something else that is causing the issue?

madsmtm commented 2 weeks ago

on unclarified aarch64-apple-darwin environment(s) thread_local! panics on https://doc.rust-lang.org/src/core/ptr/mod.rs.html#1277 .

What do you mean by "unclarified"? Knowing that might help us to reproduce the issue?

dj8yfo commented 2 weeks ago

@madsmtm you might ask @peitalin from the linked issue about specifics of his environment, where he reproduced it. We didn't reproduce it on our macs either.

The command sequence to run the same scenario which triggered panic would be

# install cargo-near to $CARGO_HOME/bin
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/near/cargo-near/releases/latest/download/cargo-near-installer.sh | sh
cargo near new test
cd test/
cargo test

If the binary installer doesn't work for some reason, it can be installed from source by cargo install cargo-near

dj8yfo commented 2 weeks ago

@madsmtm the code you tried is much simpler than the one from near-sdk repo. One from near-sdk repo uses generic parameter and also some unsafe to create references to self (without ouroborous). The self-referential part has been there for some odd 3 years, while the generic parameter has been added recently, a couple of months ago.

madsmtm commented 2 weeks ago

@peitalin:

madsmtm commented 2 weeks ago

Linking possibly related https://rust-lang.zulipchat.com/#narrow/channel/122651-general/topic/Weird.20crash.20using.20thread.20locals.20Apple.20Silicon.20debug / https://github.com/zama-ai/tfhe-rs/issues/1687.

So I'll also ask:

peitalin commented 2 weeks ago

@madsmtm I've just tried your example with RUSTFLAGS="-Clinker=/usr/bin/ld" cargo run and reproduced the errors.

My system specs are:

Xcode version: Version 15.2 (15C500b)

clang++ -v:

Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ ld -v

@(#)PROGRAM:ld  PROJECT:dyld-1015.6
BUILD 20:37:42 Aug 14 2023
configured to support archs: armv6 armv7 armv7s arm64 arm64e arm64_32 i386 x86_64 x86_64h armv6m armv7k armv7m armv7em
will use ld-classic for: armv6 armv7 armv7s arm64_32 i386 armv6m armv7k armv7m armv7em
LTO support using: LLVM version 15.0.0 (static support for 29, runtime is 29)
TAPI support using: Apple TAPI version 15.0.0 (tapi-1500.0.12.3)
Library search paths:
Framework search paths:

sw_vers

ProductName:        macOS
ProductVersion:     13.5.1
BuildVersion:       22G90

I've added a full backtrace of your code snippet and system specs here: https://github.com/peitalin/madsmtm-demo

madsmtm commented 2 weeks ago

Nice! I reproduced it now (both in a VM on macOS 13.5, and on macOS 14.7), it's a problem with the linker in Xcode 15.0 and Xcode 15.1, which unfortunately happens to be the latest supported on macOS 13.5.

Snippet for reproducing

Copy [this](https://github.com/rust-lang/rust/issues/132704#issuecomment-2466877452) into `foo.rs`. ```sh # Command Line Tools for Xcode 15.0/15.1 # The latest Command Line Tools installation overwrites this path rustc foo.rs -Clinker=/Library/Developer/CommandLineTools/usr/bin/ld && ./foo # Xcode 15.0 rustc foo.rs -Clinker=/Applications/Xcode\ 15.0.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld && ./foo # Xcode 15.1 rustc foo.rs -Clinker=/Applications/Xcode\ 15.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld && ./foo ```

Things do seem to work in Rust 1.78 though, I'll run a bisection, and see if I can find the offending commit (to figure out if this is indeed a regression, and if there's anything we can do about it).

RalfJung commented 2 weeks ago

Rust recently gained extra assertions in debug builds to detect Undefined Behavior caused by misaligned pointers. That is likely why this started failing. Already on older versions of Rust, this code had Undefined Behavior, and could be the cause of subtle misbehavior -- newer Rust just makes this misbehavior easier to detect.

If the linker causes some pointer to be null/misaligned, that sounds like a critical bug in that linker. So it'd be good to know exactly which versions of the linker are affected.

RalfJung commented 2 weeks ago

Cc @saethlin, your checks found another bug. :)

Nice! I reproduced it now (both in a VM on macOS 13.5, and on macOS 14.7), it's a problem with the linker in Xcode 15.0 and Xcode 15.1, which unfortunately happens to be the latest supported on macOS 13.5.

Is this issue documented/tracked somewhere public? Are there newer linker versions available? I don't know anything about macOS, I am just surprised that an OS released just one year ago wouldn't be able to run the latest version of the linker.

EDIT: Actually https://developer.apple.com/documentation/xcode-release-notes/xcode-15_2-release-notes says that XCode 15.2 should work on macOS 13.5.

madsmtm commented 2 weeks ago

It takes quite a lot of time to test these things, Xcode is large ;)

But lemme write down exactly what I've tested so far:

And in between these there's a lot of betas and release candidates.

EDIT: Actually https://developer.apple.com/documentation/xcode-release-notes/xcode-15_2-release-notes says that XCode 15.2 should work on macOS 13.5.

Yeah, I should've clarified that I only really tested the Command Line Tools (and no Command Line Tools for Xcode 15.2 exist).

madsmtm commented 2 weeks ago

Is this issue documented/tracked somewhere public?

Not that I know of? Maybe rdar://110340167, though I don't have access to that myself, just guessing based on a similar issue found by googling?

Are there newer linker versions available?

With a newer Xcode, yes. Without, no (not practically, at least).

I don't know anything about macOS, I am just surprised that an OS released just one year ago wouldn't be able to run the latest version of the linker.

Xcode generally only wants to be run on the latest macOS release. Usually it works fine in older versions, but you have to edit a configuration file and run an "unsupported" build, so most people don't.

Also linking possibly related https://github.com/rust-lang/rust/issues/90959.

madsmtm commented 2 weeks ago

You're probably right that this is / has been UB for a long time. A bisection led me to a6cdd81eff52566542cecdc1ce381dbe42cf77fb, which sounds very much like a red herring (the only thing of relevance that PR changed is a bit of panic code, which might explain why alignment would change between that and the previous commit).

EDIT: Yeah, definitely a red herring, I've later found previous commits that also exhibit the problem. So not really something that can be bisected.

I'll try to take a look at the difference in assembly between a version that succeeds and one that fails, but it's difficult, since it seems like it requires the standard library to be linked in a certain way (I couldn't trigger the issue with -Zbuild-std).

saethlin commented 2 weeks ago

Yeah this looks exactly like what I previously poked at on Zulip, it's somewhat comforting that this seems to depend on the Apple Xcode bag of tools.

which sounds very much like a red herring

FWIW, the MIR inliner used to use item hash comparisons to prevent query cycles, which meant that literally any edit to a crate would cause the inliner to make different decisions. The only thing you can do in this scenario is turn off the MIR inliner with RUSTFLAGS=-Zinline-mir=no or RUSTFLAGS=-Zmir-enable-passes=-Inline. But of course the standard library in the range you're trying to bisect through has been precompiled with the unstable MIR optimization pipeline.

So you'd need to make -Zbuild-std work to get a usable bisection. By default, it builds the standard library with the optimization settings for the rest of the build. So if you don't set --release, you get a quite different standard library build. Are you just using cargo run -Zbuild-std without --release?

madsmtm commented 2 weeks ago

FWIW, the MIR inliner used to use item hash comparisons to prevent query cycles, which meant that literally any edit to a crate would cause the inliner to make different decisions. The only thing you can do in this scenario is turn off the MIR inliner with RUSTFLAGS=-Zinline-mir=no or RUSTFLAGS=-Zmir-enable-passes=-Inline. But of course the standard library in the range you're trying to bisect through has been precompiled with the unstable MIR optimization pipeline.

That explains a lot, thanks for the tips!

Are you just using cargo run -Zbuild-std without --release?

Yup, couldn't reproduce this issue at all with --release, but perhaps that's just 'cause I didn't try hard enough.