rust-lang / project-error-handling

Error handling project group
Apache License 2.0
268 stars 18 forks source link

Should we add an panic interface that reports the error via the panic handler but unconditionally aborts? #34

Open yaahc opened 3 years ago

yaahc commented 3 years ago

Right now panic behavior is defined by binary authors and is applied globally for an application. This can mean either all panic!s abort the application or that all panics unwind the application. This design can cause issues for library authors where they need to assume that any panic they invoke could potentially start unwinding, requiring library authors to plan for exception safety. They can work around this by directly aborting the application with std::process::exit, but this means they're no longer leveraging the same panic reporting logic that the rest of the application is.

The error handling project group should look into adding an interface for panics that always causes an aborting panic, even if the rest of the application has been compiled with panic = unwind.

dtolnay commented 3 years ago

The current way library authors do this is:

macro_rules! panic_abort {
    ($fmt:expr $(, $args:expr)* $(,)?) => {{
        struct StopUnwind;
        impl Drop for StopUnwind {
            fn drop(&mut self) {
                panic!("treating panic as abort");
            }
        }
        let _abort = StopUnwind;
        panic!($fmt $(, $args)*);
    }};
}

fn main() {
    panic_abort!("...");
}
luser commented 3 years ago

Firefox wound up with a variety of macros in C++ to cover a range of possibilities:

(After typing that all out I realize that this is not exactly the same topic but I'm going to leave this here because it might be informative to someone.)

We always built Firefox with C++ exceptions disabled, which meant we had to audit STL data structure implementations to ensure that they were safe for use without exceptions. We also had issues with ObjC exceptions and wound up with an unfortunate set of macros to wrap all Cocoa API accesses with the ObjC equivalent of try/catch to ensure that ObjC exceptions didn't cause us to unwind past the FFI barrier.

Question: assuming a panic_abort! is hit, what does the user running a program built with panic = unwind see? The panic = abort case was fine for Firefox because we had a robust crash reporting system, so we would capture and triage panic!s from Rust code just like any other crash.

yaahc commented 3 years ago

Question: assuming a panic_abort! is hit, what does the user running a program built with panic = unwind see? The panic = abort case was fine for Firefox because we had a robust crash reporting system, so we would capture and triage panic!s from Rust code just like any other crash.

My expectation is that it would work the same as if that single panic had been compiled with panic = abort. It would invoke the panic handler and then abort instead of unwinding, so you would still be able to get a crash report out of it before it kills the process.

thomcc commented 2 years ago

I must admit, I'm a little I'm unsure[^1] what the situation where you need to abort (you cannot panic) but are still in a position where it's fine to call an arbitrary panic handler (particuarly one set by safe code...).

[^1]: I suppose cases like Arc's refcount overflow, where the problem is detected far in advance, and it will take another isize::MAX increments before the unsafety occurs? This seems rare.

Personally, I've wanted an aborting-panic many times, but in most of them I need to abort because I have hard limitations on what functions can be called, as it may be unsafe to call an arbitrary panic handler written by safe code -- If I could call arbitrary code, I'd just panic. Some examples are:

  1. In a signal handler panicing would be bad because it's UB to call anything that isn't signal safe, which is most stuff -- anything that takes locks, or allocates memory (unless it does so directly from the OS, such as with anonymous mmap/VirtualAlloc).

  2. A global allocator (either #[global_allocator] or overridden malloc). In these, if I want to abort, it's likely because state is screwed, probably indicating the user's code has corrupted state hopelessly (via memory errors like heap overflows, use after free, etc).

    Calling back into user code is wildly unsafe in this case, and you really want to just dump a message and die as fast as possible. Critically, you should assume everything allocated by the allocator has also been corrupted, and especially not call any function pointers stored in the heap (attackers often turn heap overflows into code execution by overwriting function pointers stored in heap-allocated objects), and it's also probably unsafe to call back into the allocator[^2].

[^2]: This one can be addressed by the allocator using a global flag in principal, but there's not much of a point.

In either of these cases, calling the panic hook and handler are not good ideas, because that stuff isn't likely to be prepared for this. Certainly, the panic hook is not, as it is set by safe code.

We can still do some stuff though, such as can print out a message out (using libc::write on unix -- actually, I've always used it on windows too, but stdlib probably shouldn't, and there's likely some windows-specific alternative), maybe inform the OS of some extra crash info (macOS supports this in abort_report_np or something like that), before aborting.

This all may sound niche, but low level code hits it a lot I think (it's common enough that off the top, both folly and abseil have code for this kind of case), and the lack of support for writing this kind of low-level code has been painful in Rust for a long time (there is more to fixing that than just adding this functionality[^more] it would be a good start).

[^more]: Specifically, the fact that nearly everything can panics, but the panics will do all kinds of things that aren't allowed... I have no idea how to fix this though, since thread locals, compiler flags, and per-crate/function attributes are all non-viable for different reasons. This is not project-error-handling's problem, though

so you would still be able to get a crash report out of it before it kills the process

It's worth noting that the Firefox crash reporter is probably a separate process, so it wouldn't really be an example of a case that wants its panic handler called before an abort. In-process crash reporters exist, but (for langs without a runtime) they generally are running as signal handlers, but, well, see the limitations above.

yaahc commented 2 years ago

I must admit, I'm a little I'm unsure1 what the situation where you need to abort (you cannot panic) but are still in a position where it's fine to call an arbitrary panic handler (particuarly one set by safe code...).

the example that came up recently was wishing we could panic instead of abort from safety checks that are essentially debug asserts contributed in https://github.com/rust-lang/rust/pull/92686

I also vaguely remember some case where we had an async application running on tokio and we had certain tasks that when they panicked we wanted it to take down the whole system immediately, but the way the panics were caught and propagated significantly delayed the teardown. That said, I can't recall why this was and I wouldn't be surprised if the issue was specific to that application's architecture rather than being a general issue with panic propagation latency across task/thread boundaries.

thomcc commented 2 years ago

That said, I can't recall why this was and I wouldn't be surprised if the issue was specific to that application's architecture rather than being a general issue with panic propagation latency across task/thread boundaries.

Ah, this is a general issue with tokio I think: (https://github.com/tokio-rs/tokio/issues/2002). That said unless people are actually calling nobikeshed::aborting_panic!() you still have the same issue (however panic=abort handles it well for the most part, but doesn't play well with other tooling 😢)

wishing we could panic instead of abort from safety checks that are essentially debug asserts

Hm, according to https://github.com/rust-lang/rust/pull/92686/files#diff-7e8efe3110a5b74834b0a6e63768df9982c1599758a6634337ce356a0e22e0f4R1994 it's about code size. While I actually think there's are likely improvements[^1] we can do to reduce the cost of panic handling... But I don't think this is ever not going to be the case for these sorts of checks. It's going to be hard to get anywhere near to core::intrinsics::abort()'s code size in general (for example, parameter passing, PIC stuff and/or GOT/PLT shenanigans, ...), and for cases like this, it's... probably the right call.

[^1]: I've wanted for a while to measure the impact of switching #[cold] noreturn functions to use a calling convention which is designed to be used for functions which are rarely called (perhaps coldcc), for example. (Last time I looked, LLVM will end up putting fastcc on many of these which is... pretty goofy)

yaahc commented 2 years ago

Hm, according to https://github.com/rust-lang/rust/pull/92686/files#diff-7e8efe3110a5b74834b0a6e63768df9982c1599758a6634337ce356a0e22e0f4R1994 it's about code size. While I actually think there's are likely improvements1 we can do to reduce the cost of panic handling... But I don't think this is ever not going to be the case for these sorts of checks. It's going to be hard to get anywhere near to core::intrinsics::abort()'s code size in general (for example, parameter passing, PIC stuff and/or GOT/PLT shenanigans, ...), and for cases like this, it's... probably the right call.

My assumption was the codesize concern has to do with the landing pads or w/e that get generated to help with unwinding, not the actual invocation of the panic handler to report an error. Also, fwiw you lost me on the second half of this paragraph 😅, not sure what PIC/GOT/PLT mean.

nbdd0121 commented 2 years ago

Yep, lots of code size issues arise from landing pads. A noreturn nounwind call wouldn't be terribly bad.

PLT shouldn't matter here, but if you enable PIC in x86 then there might be a thunk call due to lack of PC-relative encoding. In x64 or other archs there are PC-relative encoding, so even with PIC the code size should still be reasonable.

thomcc commented 2 years ago

not sure what PIC/GOT/PLT mean.

Sorry. I mean the overhead associated with position independent code (PIC) and dynamic loading and linking. To handwave a bit the GOT (global offset table) and PLT (procedure linkage table) are functionality associated with this which may make a function calls somewhat more costly in code size, especially when compared to intrinsics::abort() which is often a single instruction.

The functions where those checks are used seemed quite important, so it seems plausible to me that the overhead would add up, but it's certainly possible I'm overestimating the impact on modern systems, or underestimating the size cost of the landing pads.

Anyway I'm not sure any of that actually matters for this issue.


I thought of a better explanation for my position, and why I found it surprising that this would run the hook. It boils down to that I consider abort and panic to be for slightly different uses. (Caveat: essentially-philosophical personal opinions on aborting and panicking below)

Essentially, I see these as APIs which are for indicating two different types of errors:

  1. panic is for unexpected errors which may be recoverable and are presumed to only impact the panicking thread. I don't know that this is how we explain it anymore, but it's basically how we behave under the default compiler flags. That is we attempt recovery by unwinding the thread until one of the following things happens:

    • The panic goes past main() (or the program exits for whatever reason), meaning recovery wasn't possible
    • The panic gets caught, meaning recovery worked at least well enough for the thread to continue execution on its own
    • The panic goes past the thread's entry point, meaning recovery for the thread wasn't possible. Recovery stops here though, because the assumption is that the issue only impacted the panicking thread (and if it impacted anything else, they'll figure it out on their own)
  2. OTOH, abort is for unexpected errors are completely unrecoverable and are assumed to impact the entire program's global state. It is worth noting that hitting this kind of issue essentially requires unsafe code -- if you never run unsafe code, there should be no way for global state to be corrupted to this extent.

    In this context, recovery can't be attempted. Reporting is possible but must be done extremely carefully[^careful].

    Generally, only signal-safe functions can be used, no locks may be taken, no memory may be allocated, etc. Whether or not a function is callable in this context is something only unsafe code can promise, which rules out the current hook. (That said, it doesn't rule out some hook for reporting a fatal error, but setting it would need to require unsafe code)

Now, under this lens, I see -Cpanic=abort as semantically equivalent to the normal panic (case 1), but optimized to not bother with the recovery process under the assumption the program does not care about recovering from those kinds of failures.

A aborting-panic that behaves like -Cpanic=abort would be viable few[^2] more cases than the unwinding panic, but to me it would be a little confusing since usually explicit abort is reserved for truly fatal errors like case 2 -- otherwise you'd find some more graceful way to end the program.

So I think we should have a pretty good reason not to want to support case 2. (And if we decide that's not a supported use case, we should document this fact clearly, as it seems pretty surprising otherwise)


Hopefully that made a little more sense. Sorry for how long it was, and if I said anything twice)

[^2]: It's useful in at least two non-optimization scenarios:

  1. FFI does not want to panic over the extern "C" boundary. It used to be UB and is now... hard to say -- we claim in the reference it will abort when you cross the boundary, but IIUC that behavior is not enabled yet. (This will be fixed when feature(c_unwind) stabilizes, at which point a panic that tries to unwind across extern "C" will abort).

  2. Cases where you don't expect anybody to catch the panic. Like tokio::spawn (in the the multi-threaded runtime). This is pretty rare aside from the notable exception of tokio::spawn and std::thread::spawn (and in the 2nd case it tends not to bite you IME, since you don't usually treat threads as fire-and-forget as you do for async tasks)

luser commented 2 years ago

I think I agree with @thomcc's summary of affairs, but also:

  1. So, functionally, the only difference between this proposed API and simply calling std::process::abort or core::intrinsics::abort is that the panic handler would be invoked prior to aborting the process?
  2. I believe very strongly that the existing design where the binary crate gets to choose the panic behavior is the correct one. It is extremely frustrating to have arbitrary code decide to terminate your entire process in a manner inconsistent with your choices. (Yes, I know anyone can call std::process::exit and the like, but I don't think we should add more ways to do this.) Having this API would mean that now you no longer know what the panic behavior of your application is without static analysis. It's already annoying enough to audit dependencies for panic!s, this would make that harder.
  3. The codesize overhead of panic! is rough, and I'm sympathetic to that. I'm not really sure it's possible to have our cake and eat it, too in this situation. If you're not going to let the unwinder print a nice stack trace and you're going to take down my entire application, you might as well be honest about it and just abort directly.

Aside: Firefox does have an in-process exception handler for crashes that are not happening in a child process with a parent to handle them, but it's not a Rust panic handler because it wants to handle all sorts of signals/exceptions/etc, so it's not really germane to the discussion here.

nbdd0121 commented 2 years ago

I think it's obligatory for me to reference rust-lang/rust#92988 in case you aren't aware of that. We already have an aborting panic mechanism in place as part of project-ffi-unwind, this is currently used to report unwind in a context that unwinding is forbidden (e.g. extern "C"), just like how std::terminate() is called when unwinding goes past noexcept functions in C++.

yaahc commented 2 years ago

I think it's obligatory for me to reference rust-lang/rust#92988 in case you aren't aware of that. We already have an aborting panic mechanism in place as part of project-ffi-unwind, this is currently used to report unwind in a context that unwinding is forbidden (e.g. extern "C"), just like how std::terminate() is called when unwinding goes past noexcept functions in C++.

Yeah, this is how I think @Amanieu wants aborting panics to work. You just annotate the function with the proper attribute to prevent unwinding then invoke the normal panic.

Also, I went ahead and changed the title to make it clear this is more of an open question than a proposal, since I don't want people to worry that I'm pushing this in a certain direction.

8573 commented 2 years ago

both folly and abseil have code for this kind of case

For us following along at home, what are folly and abseil? The crates with those names on Crates.io appear to be not only old and abandoned but yanked.

yaahc commented 2 years ago

both folly and abseil have code for this kind of case

For us following along at home, what are folly and abseil? The crates with those names on Crates.io appear to be not only old and abandoned but yanked.

I believe those are C++ libraries, basically google and facebook's supplemental c++ std libraries, sorta like boost.

CAD97 commented 2 years ago

A slightly smaller alternative would be to provide a std::panic::call_hook function (or a macro to construct the PanicInfo internally) to call the hook without starting an unwind.

But extern "C"'s "#[nounwind] shim" doing the equivalent of panic_abort! is a good argument for the functionality being available. This would just be a blessed application of the polyfill shim without relying on panic-during-panic aborting (which I think technically isn't guaranteed?).

(The shim can be adjusted to not rely on it either by writing forget(catch_unwind(|| panic!("treating unwinds as an abort"))); abort().)