Backtracing after stack overflow does not work on macOS

losfair commented 4 years ago

I'm trying to get a backtrace from a SIGSEGV caused by stack overflow (hitting guard page). It seems that this is not working on macOS.

My reproduction case:

use backtrace::Backtrace;
use std::{mem, ptr};

#[inline(never)]
fn f(x: i32) -> i32 {
    if x == 0 || x == 1 {
        1
    } else {
        f(x - 1) + f(x - 2)
    }
}

fn main() {
    unsafe {
        let mut handler: libc::sigaction = mem::zeroed();
        handler.sa_flags = libc::SA_ONSTACK;
        handler.sa_sigaction = trap_handler as usize;
        libc::sigemptyset(&mut handler.sa_mask);
        assert_eq!(libc::sigaction(libc::SIGSEGV, &handler, ptr::null_mut()), 0);

        // Backtracing from a normal SIGSEGV works
        //println!("Before invalid write");
        //ptr::write_volatile(0 as *mut u32, 0);
        //println!("After invalid write");

        // Backtracing from a stack overflow crashes
        println!("Before stack overflow");
        println!("{}", f(0xfffffff));
        println!("After stack overflow");
    }
}

unsafe extern "C" fn trap_handler(
    _: libc::c_int
) {
    println!("Backtrace begin");
    let backtrace = Backtrace::new_unresolved();
    println!("Backtrace result: {:?}", backtrace);
}

Output:

% ./target/release/backtrace-stackoverflow-bug
Before stack overflow
Backtrace begin
zsh: segmentation fault  ./target/release/backtrace-stackoverflow-bug

Rust version:

rustc 1.46.0-nightly (16957bd4d 2020-06-30)
binary: rustc
commit-hash: 16957bd4d3a5377263f76ed74c572aad8e4b7e59
commit-date: 2020-06-30
host: x86_64-apple-darwin
release: 1.46.0-nightly
LLVM version: 10.0

alexcrichton commented 4 years ago

Thanks for the report! Can you perhaps get a stack trace in a debugger for this?

One common issue i've seen is that the sigaltstack is too small, so it may be a "double" stack overflow where the trap_handler is overflowing the sigaltstack, causing a second segfault.

losfair commented 4 years ago

I tried to allocate a 1MB sigaltstack, but the error persists:

        let mut stack_space = vec![0u8; 1048576];
        let new_stack = libc::stack_t {
            ss_sp: stack_space.as_mut_ptr() as *mut _,
            ss_flags: 0,
            ss_size: 1048576,
        };

        assert_eq!(libc::sigaltstack(&new_stack, ptr::null_mut()), 0);

I wasn't able to get a stack trace because the debugger can't resume execution from the signal handler after a EXC_BAD_ACCESS exception, due to a Darwin kernel bug.

alexcrichton commented 4 years ago

Ah sorry but without the ability to reproduce or debug I'm not really sure what's going on here, I can't really help a whole lot :(

workingjubilee commented 1 year ago

Excerpting relevant comments from the PR that adds a test to demonstrate this:

This library is not async signal safe, but it is safe for synchronous signals. In this case generating a backtrace from a segfault handler is intended to work.

—alexcrichton

Whether signal is generated in synchronous or asynchronous manner doesn't change the fact that the signal handler can only use async-signal-safe functions.

Take for example one reason why this crate isn't safe to use from a signal handler: the use of memory allocation routines. If signal is generated during an execution of a malloc, which holds an internal lock, and then the signal handler allocates memory and needs to acquire the same lock, a deadlock will occur.

—tmiasko

The segfault here is in the libunwind unwinder itself, and after researching a bit as to what's going on, it looks like the segfault is happening 16 bytes below the end of the stack. I believe the sequence of events can be reconstructed as:

Using libunwind we can get a handful of frames.

The frame that segfaults happens when we unwind the first frame of f

The frame f faulted in the middle of the function prologue

The unwind information for f is stored in a "compact format"

The compact format does not have a way to describe how to unwind in the middle of the prologue, instead it only defines how to unwind "during" the function

In interpreting the compact unwind information libunwind will hit a segfault again, trying to access memory the function itself faulted trying to push.

The issue here is that a stack overflow exception can happen anywhere in the prologue of a function, but generally unwind tables are not intended for arbitrarily happening in the prologue (there's the notion of "async unwind tables" on some systems for this). This means that the unwinder can't reliably unwind frames that are interrupted in the prologue.

Oh what I mean is that to generate a backtrace from a function that segfaulted in its prologue libunwind needs to know how to unwind from every single instruction in the function, not just the "body" after the prologue. AFAIK that's only supported with async unwind tables (and maybe full-dwarf unwind tables?), and I'm not sure how to get LLVM to generate non-compact or async unwind tables.

—alexcrichton

I do not see a reason to close this issue but to be frank, it is the sort of enhancement request that is likely to be open for a long, long time.

bjorn3 commented 1 year ago

Ignoring apple's compact unwind info I did expect backtraces to work in the prologue even without asynchronous unwinding support. Asynchronous unwinding is only necessary when popping stack frames and running cleanup code for faults at arbitrary instructions. I very much expect backtrace generation to unconditionally work at arbitrary locations. Sampling profiles depend on this.

bjorn3 commented 1 year ago

Also I believe LLVM is going to stop emitting compact unwind info for rust code or any other code not using the C, C++ or Obj-C personality functions as there is a limit of 3 personality functions in the compact unwind info format and these personality functions take up all room when used in the same executable/dylib.

workingjubilee commented 1 year ago

I do think that we should try to improve the situation, FWIW, and I am aware incremental improvements may be sufficient for many use-cases. It just seems like fixing all this is a nontrivial haul.

workingjubilee commented 1 year ago

Also I believe LLVM is going to stop emitting compact unwind info for rust code or any other code not using the C, C++ or Obj-C personality functions as there is a limit of 3 personality functions in the compact unwind info format and these personality functions take up all room when used in the same executable/dylib.

Can you confirm this and if so, open a new issue for that?

bjorn3 commented 1 year ago

If I understand correctly it got merged, then reverted because of a build error and a revert of the revert has been posted but not yet merged: https://github.com/rust-lang/rust/issues/102754#issuecomment-1580914857

rust-lang / backtrace-rs

Backtracing after stack overflow does not work on macOS #356