microsoft / windows-drivers-rs

Platform that enables Windows driver development in Rust. Developed by Surface.
Apache License 2.0
1.52k stars 70 forks source link

Panic Strategy #6

Open jxy-s opened 1 year ago

jxy-s commented 1 year ago

"Panic" in the Windows Kernel (or "bugchecking" the system KeBugCheckEx) is a last resort and should be reserved only for cases where the kernel is in a corrupted and unrecoverable state. It is usually never acceptable to bugcheck the system. I find it unlikely that any Rust crate would ever need to bugcheck the kernel.

Today this library notes a "TODO" on this topic: https://github.com/microsoft/windows-drivers-rs/blob/cd1fd23f000c64ea244b5cbfc33d64acb1104b8b/crates/wdk-panic/src/lib.rs#L29-L33

Ultimately the kernel should handle errors cleanly and, by all means necessary, prevent itself from getting into a corrupted or unrecoverable state. I worry that the introduction of a panic handler and the interactions between other crates written for the Windows Kernel could degrade the overall reliability of the kernel.

I'm not certain the best way to approach this problem. My initial reaction is disallow panic handlers when developing against the Windows Kernel and reserve "panics" (bugchecks) for the rare and extreme cases. To achieve this the crates used in the kernel would need to always communicate errors rather than giving up with a panic. Thankfully, Rust has robust and capable error handling patterns with things like match and Result. I'd like to raise this as an open topic to the community to discuss strategies for panic handling for the Windows Kernel.

wmmc88 commented 1 year ago

I 100% agree that it is typically not acceptable to trigger a bugcheck like this and I agree that code should always prevent itself from getting to a corrupted or unrecoverable state. However, panicking is a core part of the rust language, and a panic handler is required to compile rust code as it defines what happens when a rust function call triggers a panic (ie. via panic!).

My initial reaction is disallow panic handlers when developing against the Windows Kernel and reserve "panics" (bugchecks) for the rare and extreme cases.

I am unsure what you meant by disallow here. AFAIK, there is no mechanism in Rust to prevent a developer from calling panic!. Although, per the rust book, a developer should typically be writing code that returns Result<T, E> and Option<T>. I personally like to think of panic similar to assert in C/C++, and should typically only be used during development or exceptional circumstances. With that said, we still must have a panic handler defined in order to define a behaviour when panic is called.

jxy-s commented 1 year ago

I am unsure what you meant by disallow here. AFAIK, there is no mechanism in Rust to prevent a developer from calling panic!.

Here is an example of a panic handler implementation that statically asserts at compile time that the resulting program never panics: https://github.com/lachlansneff/no-panics-whatsoever/blob/main/src/lib.rs

I personally like to think of panic similar to assert in C/C++, and should typically only be used during development or exceptional circumstances.

I understand, however asserts are not present in release builds, but panics are. So, maybe a possible strategy here is to ensure that a release binary can never panic?

0xf005ba11 commented 1 year ago

My interpretation of what jxy-s is saying is panic is likely overused in the rust language right now, and this could lead to quality issues with spurious BSODs from hidden panic calls. A common example of this would be memory allocation with the rust containers (Vec, etc). Rust has added custom allocators to containers, but it has been considered experimental for quite some time now.

I think it would be beneficial to consider options with how to make this a bit safer for developers to avoid some mishaps that are likely going to be very easy to make (e.g. accidentally using the wrong memory allocator).

jxy-s commented 1 year ago

For posterity, Rust support has been already been discussed in length for the Linux kernel, I'll quote Linus here from his email:

I do think that the "run-time failure panic" is a fundamental issue.

I may not understand the ramifications of when it can happen, so maybe
it's less of an issue than I think it is, but very fundamentally I
think that if some Rust allocation can cause a panic, this is simply
_fundamentally_ not acceptable.

Allocation failures in a driver or non-core code - and that is by
definition all of any new Rust code - can never EVER validly cause
panics. Same goes for "oh, some case I didn't test used 128-bit
integers or floating point".

So if the Rust compiler causes hidden allocations that cannot be
caught and returned as errors, then I seriously think that this whole
approach needs to be entirely NAK'ed, and the Rust infrastructure -
whether at the compiler level or in the kernel wrappers - needs more
work.

So if the panic was just some placeholder for things that _can_ be
caught, then I think that catching code absolutely needs to be
written, and not left as a to-do.

And if the panic situation is some fundamental "this is what the Rust
compiler does for internal allocation failures", then I think it needs
more than just kernel wrapper work - it needs the Rust compiler to be
*fixed*.

Because kernel code is different from random user-space system tools.
Running out of memory simply MUST NOT cause an abort.  It needs to
just result in an error return.

I don't know enough about how the out-of-memory situations would be
triggered and caught to actually know whether this is a fundamental
problem or not, so my reaction comes from ignorance, but basically the
rule has to be that there are absolutely zero run-time "panic()"
calls. Unsafe code has to either be caught at compile time, or it has
to be handled dynamically as just a regular error.

With the main point of Rust being safety, there is no way I will ever
accept "panic dynamically" (whether due to out-of-memory or due to
anything else - I also reacted to the "floating point use causes
dynamic panics") as a feature in the Rust model.

The concern I have and the reason I raised this topic is in the same vein that Linus raises in the quoted thread.

If Microsoft provides this crate with the statement of (from the README), "This repo is a collection of Rust crates that enable developers to develop Windows Drivers in Rust. It is the intention to support both WDM and WDF driver development models." This subject must be considered carefully. And Microsoft should provide ergonomically safe and appropriate facilities to "develop Windows Drivers in Rust".

jgarvin commented 3 months ago

I have a related question -- I ended up here because I was wondering whether panics in Rust code in windows kernel space would be fundamentally unsafe because of the kernel APIs not being written with unwinding in mind? Most C code typically isn't. But IIRC windows SEH involves something like unwinding so I wasn't sure if that's true or not. Most of my dev experience is on Linux so out of my element here.

jxy-s commented 3 months ago

Windows SEH does not unwind in the way I believe you're expecting it to. It only invokes "Termination Handlers" (__finally blocks) during stack unwinding. See: Microsoft Docs on Structured Exception Handling.

For Rust in the Windows kernel to use panic_unwind - which I assume is what you're referring to - the kernel would need to support all the necessary accounting to unwind Rust objects on the stack when a panic occurs. My concern with this strategy is that it seems to involve attempting to handle undefined behavior and recover from it in a non-obvious manner. If something neglects to handle a unwinding panic, for any reason, what is the system to do? - Bugcheck - which is unacceptable.

This crate currently requires panic_abort, and its implementation causes the kernel to enter an infinite loop. This is unacceptable. Imagine if your kernel entered an infinite loop when you moved your mouse... or, worse yet, while it's monitoring your vitals during surgery. Handling panic unwinding might be a suitable solution, but I have concerns about its practicality for the Windows kernel. As I suggested earlier in this thread, I believe the most appropriate strategy for kernel development is to ensure there are no panics whatsoever. This approach maximizes well-behaved programs, which is especially important for the kernel.

jgarvin commented 3 months ago

This is a bit pedantic but I think it's an important distinction: technically a panic does not imply UB. The panic happening prevents UB (because e.g. the out of bounds access doesn't occur), and unwinding and even catch_unwind is considered "safe", which by definition means no UB. Panicking and catching panics does not mean you've entered a UB state, so how the compiler must behave is still well defined. That said anything in the call stack that's not Rust requires an unsafe FFI call somewhere, so in practice if there's some C code on the call stack that doesn't expect to be unwound, or has a type of unwinding handler that isn't invoked for some reason, you should assume unwinding through that frame triggers UB.

There's 3 aspects to "playing nicely" with panic:

I think you would want all 3 to be true for it to be safe.

oberrich commented 2 months ago

While having no panics at all obviously solves this issue, when there is a panic what does that mean? You've fundamentally violated some invariant and have to assume your program, which is not just your driver but the kernel itself, is now in an invalid state.

When we encounter an unhandled exception inside a module of a program, we crash that program. When we encounter an unhandled exception inside a kernel module, we should also crash that program (ntoskrnl).

In my opinion crashing the kernel in this case is the only sensible option when actually encountering a panic.

For debug mode, I personally prefer to print the actual panic info, emit an int3 and then infinite loop so I don't have to reboot my VM every time I do an oopsie.