rust-embedded / wg

Coordination repository of the embedded devices Working Group
1.87k stars 96 forks source link

Making `panic-never` a requirement or convention for `rust-embedded` libraries where feasible #551

Open mitchmindtree opened 3 years ago

mitchmindtree commented 3 years ago

TL;DR

It would be great if rust-embedded adopted panic-never as a standard for libraries. I found it impossible to take advantage of panic-never while also taking advantage of the rust-embedded libraries necessary to build my first non-trivial Rust embedded project. This was due to frequent uses of panic! throughout libraries. This is totally understandable as Rust itself provides very little tooling to avoid this and almost encourages it (i.e. indexing, slicing). While panic! is very useful for quick iteration in software, it can be detrimental to firmware without significant tooling/logging in the panic-handler which isn't always feasible for embedded projects, especially due to severe lack of context in the panic handler compared to regular error handling. The newish no-panic crate may help with retrofitting panic-never, and const generics may help to avoid common causes of panicking code branches.


Context & Motivation

This suggestion comes from my experience writing firmware for a kinetic artwork over the past few months. I finally have had some time to reflect and thought I'd open an issue here to get others' thoughts on this :) It seems particularly good timing considering that avoiding Rust panics seems to be the hot topic for landing Rust support in the Linux kernel today https://lkml.org/lkml/2021/4/14/1099.

This was one of my first major firmware projects, involving a pretty tall stack of protocols including SPI for LEDs, I2C for time of flight, one-wire UART for motor driver control and Ethernet for real-time TCP/IP communication with the master software. To achieve this project in the deadline that we had would have been impossible without all the awesome existing work in the rust-embedded ecosystem. The fact that I could include an Ethernet bootloader, serialization between software and firmware, use a real-time scheduler and more thanks to existing work made the project possible! Naturally, this required leaning on quite a few dependencies, including postcard (and in turn serde), rtic, smoltcp and loads of others.

During early prototype testing, I quickly learned just how drastic panic!ing could be in firmware compared to my experience with writing software, particularly when controlling a large number of motors attached to expensive parts. This lead me to search for solutions to ensure that I could avoid panicking entirely, which lead me to the panic-never crate.

After a few days of commenting out the entire project and trying to add modules back one by one with panic-never included, I quickly realised that, while I could track down and address all of the panic! sites in my own code, it would be impossible for me to track down and address all panic! sites throughout all the dependencies that I required for the project to function - especially considering the limited, cryptic linker errors that panic-never could provide, resulting in an approach that consisted of commenting everything out and re-adding parts one at a time until the linker error showed up.

Following the realisation that I would have to accept the possibility of panic!s, I began work on a custom panic handler. Easily the largest problem with the custom panic handler was the lack of context, and not knowing what state the device was in when the panic occurred... This lead to the need for moving parts of the application state into global state. This was necessary to 1. send some indication of an error back to the master via Ethernet (provided it was even possible to do so in the panicking state) and 2. disable the motor via UART! This was of utmost importance as the motor driver has it's own step generator, and if the last thing it received was some high velocity before the panic, then there was nothing else stopping it from endlessly driving out the motors until someone freaks out and cuts the power :scream:

Beyond the obvious reasons why moving state into a global context was unpleasant, I was using RTIC to handle scheduling. RTIC requires managing state in a certain way in order for its priority task system to function in a safe-yet-efficient manner. This meant lots of acrobatics with mutexes and critical sections in order to expose the necessary networking and motor state to the panic handler through a global context, much of which I'm still uncertain is actually safe to this day.


I want to acknowledge that all of these problems are ultimately our own fault. Specifically, for cornering ourselves by accepting a timeline for a project that meant I simply couldn't both 1. take advantage of many of the awesome existing crates throughout the rust ecosystem that were necessary to make such a sophisticated project possible in a short amount of time and 2. actually review all of these dependencies and develop enough familiarity with their src to guarantee there could be no panic! conditions throughout. It is this choice that lead to the need for the aformentioned hacks and awkward panic handling solution.

That said, I think it is at least worth checking whether or not it is possible to have our cake and eat it too by investigating the feasibility of having panic-never as a standard practise for embedded libraries. I cannot tell you how much of a relief it would be to know for certain that it simply wasn't possible to panic, particularly when the firmware is moving 100s of motors around on an artist's budget that provides very little room for repairs :joy: While custom panic handlers help, they provide almost no context about the state of the system during a panic by default and encourage some serious anti-patterns in order to handle those cases.

no-panic

I think perhaps this is more achievable now that no-panic exists, allowing for a more granular approach to narrowing down panic! sites, also with slightly better error messages. The function attribute approach allows for achieving a panic-less codebase one function at a time, without having to solve everything at once as is required with panic-neveralone.

Indexing, slicing and const generics

I think const generics may also play a large role in making this possible. Perhaps the sneakiest and most prevalent culprit for introducing panicking code is rust's core Indexing and slicing methods. This is especially frustrating when most embedded code works with fixed size arrays, where the author performing the indexing/slicing knows that it is safe to do so and that it is impossible for the panic to actually occur. I wonder if we can come up with some const-generic based approach to bounds checking for indexing and slicing of fixed size arrays that avoids the need for generating panicking branches.

rustc

Another approach might be to instead focus on landing support for avoiding panicking in rustc itself? While no-panic is already a big improvement over panic-never, it is still a long-shot from having a nicely formatted call-stack with line-numbered links to the source code of each function call that leads to each panic. I'm yet to investigate existing proposals for such a tool.


My aim with this issue is mostly to begin a discussion. I'm curious to hear others' thoughts, i.e. Have you had similar experirences? Is this a worthy/pracitcal goal? Or perhaps infeasible for reasons I haven't touched on yet?

Dirbaio commented 3 years ago

I'd like to add that another common reason for using panic-never is that rustc isn't normally able to remove bloat associated with panic handling (panic message strings, some fmt code). This is the case even if the panic handler is something barebones that doesn't log panic messages at all.

However, it is actually possible to remove this bloat in nightly, putting this in .cargo/config. This makes panics compile down to a single udf instruction, just 2 bytes! No strings, no fmt code.

[unstable]
build-std = ["core"]
build-std-features = ["panic_immediate_abort"]

I'm pointing this out because I've seen people on Matrix try to get rid of panics due to the bloat, and the build-std debloat trick seems not very widely known.

This gets rid of the bloat argument, the safety/correctness argument does remain.


Another important thing is that the concept of "panic" or "fatal error" is not only a Rust concept, it's an ARM architectural concept. Even if you completely get rid of Rust panics in your code, you still have HardFault and the other exceptions! They can be triggered by invalid memory accesses, invalid peripheral register reads/writes, division by 0, floating point exceptions, and lots of other chip-specific things. These are NOT caught by panic-never.

Therefore if you're writing safety-critical code (like stop a motor on crash), you still have to handle HardFaults, and you still have all the associated problems (need to make state global, etc), even if you use panic-never.

HardFaults can be even triggered by hardware issues causing bitflips or other glitches (eg EMI interference, severe power supply noise). Even if you're ultra 100% pinkie-promise sure your code can never ever HardFault, you should still handle HardFaults :)

If you still have to handle HardFaults, why bother going through the extreme pain of trying to get rid of Rust panics?

tarcieri commented 3 years ago

Something of a sidebar, but: is anyone aware of a tracking issue for adding a first-class feature for this to rustc?

I know there are a few options for this in crates, including panic-never and no-panic, but as noted above it seems like if the compiler were aware that code shouldn't panic and emitted errors in situations where it were to panic, it could generate better code with the knowledge that everything that could cause a panic has been checked in advance.

mitchmindtree commented 3 years ago

If you still have to handle HardFaults, why bother going through the extreme pain of trying to get rid of Rust panics?

Great point @Dirbaio - and slightly terrifying! Makes a note to look into HardFault handling on stm32f107 for the artwork.

On a related note, maybe in our case RTIC could help here by allowing the user to declare a dedicated task for handling panic/fault events, the difference being RTIC could provide some access to the current context (i.e. Resources)? I'm imagining it would be similar to handling any other interrupt, but in the case of panic or fault it would have the highest priority? I guess handling the possibility of a panic! occuring during Drop might make this particularly tricky... that said I'm not sure RTIC allows for Resources to be dropped at all, so maybe that's not a big deal.

While HardFaults do bring into question the argument that panicking encourages anti-patterns / global state, I think the benefit of allowing the user to know for sure that at least panicking is not possible is still a worthy one. I.e. having the option between having to consider unsuspecting, subtle logic errors leading to panics in dependencies as well as cosmic rays, vs only having to consider cosmic rays, I'd prefer the latter. I do acknowledge it's not at all a trivial ask though.

panic-never/no-panic and optimisation levels

Another issue worth considering that was brought up on Matrix is that panic-never may report differently depending on the optimisation level, which is likely to lead to some really confusing experiences... I.e. perhaps some seemingly unrelated code affects whether or not a panic branch can be optimised away, or a new user becomes confused when compiling with --release works but debug results in a giant linker error. I believe no-panic has this same issue. This probably improves the case for landing panic-awareness tools in the compiler itself.

mitchmindtree commented 3 years ago

#[allow_panic]?

A possible alternative approach to progressively adding #[no_panic] everywhere might be to do the opposite and disallow panicking by default somehow (via panic-never or otherwise), but provide an #[allow_panic] function decorator that made it clear which functions contain a panic! branch. While this wouldn't help the need for implementing a panic handler (even if there's only one panic branch, it needs to be handled), it would make it easier to find and review code where panicking branches exist, similarly to how unsafe functions make it easier to review sources of undefined behaviour.

This would of course be far less useful in software, as you could just do RUST_BACKTRACE=1 to find the source of a panic, however without manually setting up a lot of code specifically to do so, it can be really tricky to find where a panic occurs in firmware, especially in space constrained environments that get blown out if you touch the PanicInfo instance provided to the panic handler.

I don't know if something like this is even feasible with the panic-never-style linker-based approached, or if something like this would require some sort of panic-awareness support in rustc as previously mentioned.

nickray commented 3 years ago

If you still have to handle HardFaults, why bother going through the extreme pain of trying to get rid of Rust panics?

That's kind of whataboutism, isn't it? Typical panics are due to logic/programming errors, which I'd like the compiler's help with in avoiding, since they're my mistake (or of libraries I depend on).

ryan-summers commented 3 years ago

Hey there, I'm reading through your post and addressing specific points as I see fit.

As some background, I've been a firmware developer for the last ~8 years, most of which has been in C/C++, and the last 2 years shifted more towards Rust. Oftentimes, my work finds me in safety-critical systems.

Following the realisation that I would have to accept the possibility of panic!s, I began work on a custom panic handler. Easily the largest problem with the custom panic handler was the lack of context, and not knowing what state the device was in when the panic occurred... This lead to the need for moving parts of the application state into global state.

Custom panic handlers, which I would generally refer to as an error handler, is quite common practice in the industry and is often seen in embedded designs. Putting the device into a "known safe state" is a common approach. If this is difficult for your design, I would attribute this to the hardware that you are working on (e.g. no simple way to power off motors quickly) and not a fault of independently developed libraries or languages. Ideally, you should have a fail-free way to get the system into a safe, controlled state when fatal errors occur. This typically involves hooking up power supply shutdown/enable pins to GPIOs of the controller to quickly power off motors etc.

I think it is unfair to claim that because of your underlying hardware design, you were forced to make unpleasant firmware compromises (e.g. moving UART drivers into global scope etc.), although I do agree that panic handlers in rust are particularly more complicated than other languages.

I cannot tell you how much of a relief it would be to know for certain that it simply wasn't possible to panic

I would argue this is a false sense of relief, as it just offloads the error handling to your code, but that's a bit besides the point ;)

I wonder if we can come up with some const-generic based approach to bounds checking for indexing and slicing of fixed size arrays that avoids the need for generating panicking branches.

Based on my understanding, this is exactly how rustc generates panics for indexing. It internally generates bounds checks in the code and branches to panic if bounds are not met. How would you recommend addressing this otherwise? If you simply ignore bounds issues, you go back to languages like C and throw out memory safety. But if you don't panic, you have to somehow propagate every array index into an analyzeable Result that higher layers have to handle. I fail to see the advantage of this over simply panic!ing instead.

In regard to panics in libraries

I've written a number of external libraries at this point, and I want to point out that I tend to use the panic handler as a means of verifying that my code has been written correctly. For example, I will often unwrap() types and document the reason for those unwraps to be safe. Based on correct usage, these should be impossible to trigger, etc., but I'm not convinced that rustc would be able to optimize all of the panics away in the end. Oftentimes, panics are used as a means of enforcing logical reasoning in the code. When changes are made such that those promises are broken, panics serve as a backup for catching those faults. If you try to force libraries to never panic, you can impose code structures on them that are much more complex and don't facilitate well-designed software (e.g. deeply nested conditionals, long dependency chains, difficult internal state management).

Summary

In summary, I honestly think things are not bad the way they are. Yes, I agree that panics definitely introduce some level of binary bloat due to formatting, and yes, accessing internal resources in the panic handler is oftentimes cumbersome. However, I don't think this is a reason that panicking shouldn't exist, as error handling is a problem that has been around since long before Rust. Ultimately, as @Dirbaio pointed out, there will always need to be some way of handling unexpected events in a system.

I think the right approach here is to address:

  1. How to eliminate bloat brought about by panics
  2. How to facilitate simple resource usage in panic handlers

While I agree that (1) is likely a problem that needs to be addressed at the language level, I am of the opinion that (2) should be entirely up to the developer, and we shouldn't impose any specific way of doing this on them. Writing panic handlers to get systems into a safe state when you have simple GPIO toggles to turn everything off actually turns a custom panic handler into a very reasonable, failure-free means of getting your device into a safe state.

mitchmindtree commented 3 years ago

Thanks for your insight @ryan-summers! I just want to add/get a bit of clarification on some points:

Moving panic!s to error handling

I cannot tell you how much of a relief it would be to know for certain that it simply wasn't possible to panic

I would argue this is a false sense of relief, as it just offloads the error handling to your code, but that's a bit besides the point ;)

My relief here would not come from a place of presuming I have less problems, but that I can at least compose the handling of these problems into the rest of my code and see where they can actually occur while I write my code. That is, offloading error handling to the code is partly the point. This would be quite a contrast to potentially panicking anywhere at any time, and trying to track down the source with only the panic handler and limited resources.

Compile-time bounds checking

I wonder if we can come up with some const-generic based approach to bounds checking for indexing and slicing of fixed size arrays that avoids the need for generating panicking branches.

Based on my understanding, this is exactly how rustc generates panics for indexing.

To clarify, I'm picturing an indexing/slicing solution whose bounds checking occurs at compile time, not runtime, and not relying on the optimiser to figure it out. I'm picturing const generics getting us a little closer to something like dependent types, though admittedly I think Rust is probably still along way off of being able to do compile time bounds checking in a way that is ergonomic as a language like Idris for example.

panic!s as logic verification

I tend to use the panic handler as a means of verifying that my code has been written correctly

Oftentimes, panics are used as a means of enforcing logical reasoning in the code

As a fellow library author, I do sympathize with this, and I think it's a really strong case for disregarding this proposal - especially as enforcing no panics may lead authors to find ways around the issue with unnecessary unsafe code as you mention.

In my opinion, the fact that it's the norm to use panic!s for these use cases is a real indicator of the limitations of the expressiveness of Rust's type system. When used for this reason, you can almost see panic!s as a signal of hmmm, looks like I've reached the limit of what I can express in Rust types here, I've no choice but to say "trust me" and hope my reasoning does actually hold up, and isn't broken by further refactoring, etc. But maybe this is obvious and is probably better discussed in the proposal for a design of some new dependently typed language targeted at embedded.

I wonder just how infeasible it would be to replace assert!(condition, "this condition must be met"); with something like assert(condition, "this condition must be met")?;. The function approach would be much more limited in terms of formatting, especially if the Error::Assert/Assert return type message was limited to &'static str - that said, it appears to be this formatting code that causes a lot of the bloat in panic handling in the first place.

Accessing resources in panic!

  1. How to facilitate simple resource usage in panic handlers

I especially like this suggestion. Having a convention where, "if a rust-embedded library exposes some resources through its API then it should also provide an example that clearly demonstrates how to access these resources during a panic", would go a long way - particularly if we're all going to settle for panicking code being inevitable. I imagine having these examples/tests would not only be useful for users, but also for encouraging authors to keep that case in mind during API design.


If it does seem unreasonable to avoid all panics, do you think it's worth considering a granular approach? I'm thinking of either:

  1. disabling panics by default and using something like a pervasive #[allow_panic], where to call a function with #[allow_panic] the parent function must also be #[allow_panic] or
  2. encourage #[no_panic] where possible in libraries

where the goal of both approaches would be to narrow down the surface area that must be reviewed for panicking code? Admittedly, I'm not sure of the feasibility of the 1st option or if it's possible with what's available from the language today.

hannobraun commented 3 years ago

How to facilitate simple resource usage in panic handlers

Just a thought: Wouldn't RTIC be in the perfect position to address this? If RTIC handled panics, then the application developer can provide a panic handler that works like any other RTIC task (i.e. specifies which resources it needs and gets them from RTIC).

cr1901 commented 3 years ago

As much as I disdain the bloat, I don't think getting rid of panics or even panic_immediate_abort is desirable.

Rather, I wish rustc was smart enough to recognize (even if only with LTO enabled, for now) where there is no output device to emit strings to, and get rid of all strings plus formatting code. Perhaps getting rid of most strings/formatting code could be combined with say, panic_persist.

tarcieri commented 3 years ago

If RTIC handled panics

How? Most Rust embedded projects I've seen implement very simple panic handlers that do something like specify an eh_personality lang item that goes into an infinite loop.

Unwinding in such contexts still seems to be an open issue

therealprof commented 3 years ago

I'm not quite sure why bound checks seem to be deemed the largest cause of undesirable panic causes since they're rather easy to avoid: You can use explicit length check or slice to a known size to get bounds check elision. Or you can use fallible accessors or any of the wide range of "functional" tools.

From where I'm standing, the use of unwrap is pretty much always a bug; you can document why this will not cause a panic as much as you want, in the end it's always up to the user (or co-developer or even yourself) to adhere to the invariants you've once defined and of course yourself to always be totally right about them...

tarcieri commented 3 years ago

Or you can use fallible accessors

The get and get_mut methods of the slice primitive are fantastic for this, especially since thanks to the SliceIndex trait they can be used to access individual elements or subsclices.

Subslice patterns are also great for this if they fit your use cases.

mitchmindtree commented 3 years ago

I'm not quite sure why bound checks seem to be deemed the largest cause of undesirable panic causes

I didn't mean to imply that it's difficult to work around them, more that indexing and slicing are rather common and difficult to search for compared to easily searchable keywords like unwrap, unimplemented, panic, etc. That said, clippy does have some lints that can help find them from memory.

You can use explicit length check or slice to a known size to get bounds check elision

I realise there are lots of ways to help the compiler to optimise out bounds checks - my interest in const generics stemmed from aiming for a type-level solution that is safe yet doesn't rely on panic branches to allow consistent error behaviour regardless of the optimisation level.

The get and get_mut methods of the slice primitive are fantastic for this

subslices

Agreed, these are both very useful

hannobraun commented 3 years ago

@tarcieri

If RTIC handled panics

How? Most Rust embedded projects I've seen implement very simple panic handlers that do something like specify an eh_personality lang item that goes into an infinite loop.

Unwinding in such contexts still seems to be an open issue

Please note that I was replying to a comment about making it easier to access resources in panic handlers. I don't think it makes sense for RTIC to be involved, if the panic handling is very simple.

I could imagine that RTIC could (optionally) generate the panic handler for me though, if I specify it:

#[task(binds = panic, resources = [motor])]
fn handle_panic(cx: handle_panic::Context) {
    cx.resources.motor.disable();
}

I haven't researched this, so it might not be practical. As I said, just a thought.

hannobraun commented 3 years ago

I did some research. This has been discussed before: https://github.com/rtic-rs/rfcs/issues/27

TeXitoi commented 3 years ago
#[task(binds = panic, resources = [motor])]
fn handle_panic(cx: handle_panic::Context) {
    cx.resources.motor.disable();
}

I haven't researched this, so it might not be practical. As I said, just a thought.

That's not possible because RTIC can't mask HardFault, and masking interuptions guaranties unique &mut borrow

korken89 commented 3 years ago

@hannobraun The panic handler binding in RTIC would indeed be possible, I can add it to a discussion point in the next RTIC meeting and see how it fits into the current development plan. :)

TeXitoi commented 3 years ago

Note that you can use https://crates.io/crates/panic-persist or something similar to check at boot if there was a panic, and stopping the motor at this moment, in a sane environment.

coolCucumber-cat commented 2 months ago

If you still have to handle HardFaults, why bother going through the extreme pain of trying to get rid of Rust panics?

Nirvana fallacy. I'm never going to wear a seatbelt because I could also be shot in the head. Just because it's impossible to prevent every panic, doesn't mean we can't stop the preventable ones.