Segmentation fault when formatting u128 on aarch64 GNU/Linux

prestontimmons commented 1 year ago

Hello, we've noticed segmentation faults when running Rust binaries compiled on aarch64 GNU/Linux. We've seen this occur in multiple libraries that format or print SystemTime.

Architecture:

uname -a

5.10.135-122.509.amzn2.aarch64 #1 SMP Thu Aug 11 22:41:14 UTC 2022 aarch64 GNU/Linux

Reproducible example:

fn main() {
    let millis: u128 = 87329875;
    println!("{}", millis);
}

The segmentation fault occurs when fmt_u128 is called.

I tested this on 1.62.0 and nightly:

/builds/scratch# cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.23s
     Running `target/debug/scratch`
Segmentation fault (core dumped)

# rustc --version --verbose
rustc 1.62.0 (a8314ef7d 2022-06-27)
binary: rustc
commit-hash: a8314ef7d0ec7b75c336af2c9857bfaf43002bfc
commit-date: 2022-06-27
host: aarch64-unknown-linux-gnu
release: 1.62.0
LLVM version: 14.0.5

# cargo +nightly run
    Finished dev [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/scratch`
Segmentation fault (core dumped)

# rustc +nightly --version --verbose
rustc 1.66.0-nightly (e7119a030 2022-09-22)
binary: rustc
commit-hash: e7119a0300b87a3d670408ee8e847c6821b3ae80
commit-date: 2022-09-22
host: aarch64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.0

The segmentation fault does not occur in release mode:

# cargo run --release
    Finished release [optimized] target(s) in 0.23s
     Running `target/release/scratch`
87329875

It also does not occur if opt-level is set to greater than 0:

[profile.dev]
opt-level = 1

It also does not occur on Darwin aarch64:

uname -a

Darwin TC-4000660 21.6.0 Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000 arm64

Meta

Valgrind traceback:

Backtrace

``` # valgrind target/debug/scratch ==5157== Memcheck, a memory error detector ==5157== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==5157== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info ==5157== Command: target/debug/scratch ==5157== ==5157== Invalid read of size 4 ==5157== at 0x112474: alternate (mod.rs:1893) ==5157== by 0x112474: core::fmt::Formatter::pad_integral (mod.rs:1366) ==5157== by 0x111BBB: core::fmt::num::fmt_u128 (num.rs:641) ==5157== by 0x112347: core::fmt::write (mod.rs:1202) ==5157== by 0x15D5FB: write_fmt (mod.rs:1679) ==5157== by 0x15D5FB: <&std::io::stdio::Stdout as std::io::Write>::write_fmt (stdio.rs:715) ==5157== by 0x15E133: write_fmt (stdio.rs:689) ==5157== by 0x15E133: print_to (stdio.rs:1017) ==5157== by 0x15E133: std::io::stdio::_print (stdio.rs:1030) ==5157== by 0x10CDCB: scratch::main (main.rs:3) ==5157== by 0x10CEA3: core::ops::function::FnOnce::call_once (function.rs:251) ==5157== by 0x11B3AB: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122) ==5157== by 0x17925F: std::rt::lang_start::{{closure}} (rt.rs:166) ==5157== by 0x15B38B: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:286) ==5157== by 0x15B38B: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:464) ==5157== by 0x15B38B: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:428) ==5157== by 0x15B38B: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:137) ==5157== by 0x15B38B: {closure#2} (rt.rs:148) ==5157== by 0x15B38B: do_call (panicking.rs:464) ==5157== by 0x15B38B: try (panicking.rs:428) ==5157== by 0x15B38B: catch_unwind (panic.rs:137) ==5157== by 0x15B38B: std::rt::lang_start_internal (rt.rs:148) ==5157== by 0x17922B: std::rt::lang_start (rt.rs:165) ==5157== by 0x10CE07: main (in /builds/scratch/target/debug/scratch) ==5157== Address 0x31 is not stack'd, malloc'd or (recently) free'd ==5157== ==5157== ==5157== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==5157== Access not within mapped region at address 0x57 ==5157== at 0x112474: alternate (mod.rs:1893) ==5157== by 0x112474: core::fmt::Formatter::pad_integral (mod.rs:1366) ==5157== by 0x111BBB: core::fmt::num::fmt_u128 (num.rs:641) ==5157== by 0x112347: core::fmt::write (mod.rs:1202) ==5157== by 0x15D5FB: write_fmt (mod.rs:1679) ==5157== by 0x15D5FB: <&std::io::stdio::Stdout as std::io::Write>::write_fmt (stdio.rs:715) ==5157== by 0x15E133: write_fmt (stdio.rs:689) ==5157== by 0x15E133: print_to (stdio.rs:1017) ==5157== by 0x15E133: std::io::stdio::_print (stdio.rs:1030) ==5157== by 0x10CDCB: scratch::main (main.rs:3) ==5157== by 0x10CEA3: core::ops::function::FnOnce::call_once (function.rs:251) ==5157== by 0x11B3AB: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:122) ==5157== by 0x17925F: std::rt::lang_start::{{closure}} (rt.rs:166) ==5157== by 0x15B38B: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:286) ==5157== by 0x15B38B: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:464) ==5157== by 0x15B38B: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:428) ==5157== by 0x15B38B: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:137) ==5157== by 0x15B38B: {closure#2} (rt.rs:148) ==5157== by 0x15B38B: do_call (panicking.rs:464) ==5157== by 0x15B38B: try (panicking.rs:428) ==5157== by 0x15B38B: catch_unwind (panic.rs:137) ==5157== by 0x15B38B: std::rt::lang_start_internal (rt.rs:148) ==5157== by 0x17922B: std::rt::lang_start (rt.rs:165) ==5157== by 0x10CE07: main (in /builds/scratch/target/debug/scratch) ==5157== If you believe this happened as a result of a stack ==5157== overflow in your program's main thread (unlikely but ==5157== possible), you can try to increase the size of the ==5157== main thread stack using the --main-stacksize= flag. ==5157== The main thread stack size used in this run was 10485760. ==5157== ==5157== HEAP SUMMARY: ==5157== in use at exit: 1,109 bytes in 4 blocks ==5157== total heap usage: 9 allocs, 5 frees, 2,997 bytes allocated ==5157== ==5157== LEAK SUMMARY: ==5157== definitely lost: 0 bytes in 0 blocks ==5157== indirectly lost: 0 bytes in 0 blocks ==5157== possibly lost: 0 bytes in 0 blocks ==5157== still reachable: 1,109 bytes in 4 blocks ==5157== suppressed: 0 bytes in 0 blocks ==5157== Rerun with --leak-check=full to see details of leaked memory ==5157== ==5157== For lists of detected and suppressed errors, rerun with: -s ==5157== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault ```

Nilstrieb commented 1 year ago

I can't reproduce this segfault inside an arm64 docker container on an x86_64 host, so this seems to require a real machine and doesn't work under QEMU. Linux 8ae19ce495c5 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

saethlin commented 1 year ago

Thus far I cannot reproduce this issue. Perhaps because I'm on

Linux alarm 5.19.8-1-aarch64-ARCH #1 SMP PREEMPT Thu Sep 8 18:20:33 MDT 2022 aarch64 GNU/Linux

Is the above output from a graviton2 instance?

thomcc commented 1 year ago

Are you using mold as your linker by any chance? Seems somewhat similar to https://github.com/rust-lang/rust/issues/101247.

prestontimmons commented 1 year ago

Thanks for looking into this.

1) I also have not been able to reproduce this on x86_64 or in an emulated docker running on x86_64.

2) Yes, it is a graviton2 instance using 5.10.135-122.509.amzn2.aarch64.

3) No, this is using the default linker. mold has not been added.

I did some more testing and found an interesting result. When using cargo run directly on the host the segmentation fault is not occurring, but I see it consistently in the docker runner that runs on the host (this is part of our CI). The docker image is based on rust:1.62-slim-bullseye.

I'll dig deeper and find a more specific setup that reproduces it.

Dirreke commented 9 months ago

I met a similar issue on csky-arch when using println!.

A small u128 will return the wrong result.

let a = 0_u128;
println!("{a}"); //14082568811966739713
let a = 1_u128;
println!("{a}"); //14082568811966739714
let a = 10_u128.pow(18);
println!("{a}"); //15082568811966739713
let a = 10_u128.pow(19);
println!("{a}"); //140825688119667397140000000421709631291

A large u128 will return the segmentation fault

let a = 2_u128.pow(84);
println!("{a}"); //segmentation fault

the calculation of u128 is correct and the other format type is correct

let a = 2_u128.pow(84) ;
println!("{:b}", a); //1000000000000000000000000000000000000000000000000000000000000000000000000000000000000
let a = (0_u128 + 1_u128 ) as u64;
println!("{:b}", a); //1

Actually, I'm working on migrating code to csky arch, which is a niche arch. I introduced the csky arch to rust by https://github.com/rust-lang/rust/pull/113658 and introduced it to libc by https://github.com/rust-lang/libc/pull/3301 .

I'm not sure what caused this issue. This issue is similar with yours. It confused me and I don't know if it is just my igorance in my PR or some error in any other code.

saethlin commented 9 months ago

Nobody ever came up with a reproducer of the original report. I just spun up a few graviton instances and tried to again, and I couldn't reproduce the originally-reported crash.

I'm sure we could help out if you can come up with a reproducer that doesn't require owning some niche hardware. Is there an emulator people can run?

Failing in that, I'd try reporting this problem to your local expert on your arch. I strongly suspect that whatever is going on here is not too Rust-specific. This is probably an LLVM or linker problem, so anyone who can reproduce the problem and is experienced with low-level debugging could really help us out here by identifying what has gone wrong with the codegen. If this happens without optimizations, it's probably fairly localized. For example, if someone can point out "The instructions look good up until this one, at which point it makes no sense. The executable should contain these instructions instead."

Dirreke commented 8 months ago

I met a similar issue on csky-arch when using println!.

A small u128 will return the wrong result.
let a = 0_u128;
println!("{a}"); //14082568811966739713
let a = 1_u128;
println!("{a}"); //14082568811966739714
let a = 10_u128.pow(18);
println!("{a}"); //15082568811966739713
let a = 10_u128.pow(19);
println!("{a}"); //140825688119667397140000000421709631291
A large u128 will return the segmentation fault
let a = 2_u128.pow(84);
println!("{a}"); //segmentation fault
the calculation of u128 is correct and the other format type is correct
let a = 2_u128.pow(84) ;
println!("{:b}", a); //1000000000000000000000000000000000000000000000000000000000000000000000000000000000000
let a = (0_u128 + 1_u128 ) as u64;
println!("{:b}", a); //1
Actually, I'm working on migrating code to csky arch, which is a niche arch. I introduced the csky arch to rust by #113658 and introduced it to libc by rust-lang/libc#3301 .

I'm not sure what caused this issue. This issue is similar with yours. It confused me and I don't know if it is just my igorance in my PR or some error in any other code.

Fixed it by https://github.com/llvm/llvm-project/pull/69732 .

kpreid commented 6 months ago

Triage: Relabeling issues which don't have a runnable reproduction (as opposed to having a non-minimized one) to the new label S-needs-repro. @rustbot label +S-needs-repro -E-needs-mcve

rust-lang / rust

Segmentation fault when formatting u128 on aarch64 GNU/Linux #102196

Meta