nrc / error-docs

Documentation of Rust error handling
Creative Commons Attribution 4.0 International
49 stars 4 forks source link

Seeking Best Practices for Error and Warning Handling in Complex Rust Projects #15

Open simonsan opened 1 week ago

simonsan commented 1 week ago

Hi @nrc!

I’m reaching out to get some expert advice on the challenges we’re facing with error and warning handling in our Rust-based project, rustic_core. Our project is relatively complex, and we’re struggling to find the right balance between propagating errors (soft- and hard errors), handling warnings, and maintaining good user experience with clear error messages.

Context of Our Problem

  1. Error Handling:

    • In our current setup, we primarily rely on returning Result<T, RusticError> to propagate errors. Kind of a god enum approach, where we convert sub errors into that god error for handing it over at our API boundary. However, we often find ourselves in scenarios where multiple errors can occur (e.g., batch operations or validation processes of data collections), and handling only the first error results in lost context.
    • We are considering three primary options:
      1. Returning a single error (Result<T, RusticError>)
      2. Returning a list of errors (Result<T, Vec<RusticError>>)
      3. Returning nested Results (Result<Result<T, Vec<RusticSoftError>>, RusticHardError>)
    • We're also facing cases where we need to continue execution in the presence of some errors but fail fast in others.
  2. Warning Handling:

    • We need a consistent way to handle warnings. So far, we’ve identified three potential approaches:
      1. Logging warnings locally and not passing them back to the caller.
      2. Returning a boolean flag (is_warn) to indicate if warnings occurred.
      3. Returning a list of warnings to provide detailed information about all non-critical issues that the caller can process.
    • We’re trying to decide if warnings should be purely for operational visibility (handled via logging), or if the caller should be made aware of warnings explicitly.
  3. General Pain Points:

    • We struggle with missing contextual information in error messages, leaving the end-users without actionable guidance.
    • We want to include error codes or links to documentation in error messages for better guidance and debugging.
    • In scenarios like async operations, the logging and error handling become more difficult to manage, especially when errors are collected from multiple spawned tasks.
    • Finally, we want to reconsider how we handle warnings and errors over function boundaries, thinking we may need to simplify or keep more localized handling without propagating too much information upward.

Questions

  1. Error Propagation:

    • When should we prefer returning a single error (e.g., Result<T, RusticError>) vs. returning a list of errors (e.g., Result<T, Vec<RusticError>>)? Are there performance or architectural concerns that we should consider when deciding between these two approaches?
    • In complex async operations or batch processing, where multiple errors might occur, what would be the best way to handle error accumulation without losing key context? Is there a common pattern in Rust for handling this elegantly? Like spawning an error handling thread and communicating with it via a channel?
  2. Warnings:

    • When handling warnings, would you recommend keeping them local (i.e., logging only) or propagating them back to the caller? Under what circumstances is it better to pass warnings up vs. treating them as internal operational feedback?
    • How would you handle situations where a function should continue executing but may want to indicate that warnings occurred (e.g., via an is_warn boolean flag or a list of warnings)? What is the best approach here to maintain simplicity while giving the caller enough control over decision-making?
  3. Async/Concurrency:

    • In async tasks and concurrent operations, how do you typically manage error propagation and structured logging, especially when errors are collected from multiple spawned tasks? How can we ensure we get full visibility into errors without complicating error management?
  4. General Best Practices:

    • Are there any best practices or patterns you would recommend for error and warning handling that balance performance, code maintainability, and user experience in Rust-based systems?
    • How can we maintain a simple API for callers while ensuring we capture all relevant issues (both errors and warnings) during complex or long-running operations?
    • We also thought about a nested Result where the outer Result can contain hard errors that lead to aborting the program. While the inner Result would contain a list of errors that were coming up during the processing of data collections. Which is inspired by http://sled.rs/errors

We appreciate any guidance or patterns you’ve found useful in these situations!

nrc commented 1 day ago

@simonsan Hey, sorry for the delay in replying, I have been meaning to write a blog post about this and was hoping I could get that out and just point at it, but I haven't even started and if I'm going to be honest, it's not going to get done very soon.

Anyway, my perspective on error handling has changed a little bit from when I wrote these docs and I should update them. Here's some notes:

Some specific answers (all of which are very much 'IMO'):

When should we prefer returning a single error (e.g., Result<T, RusticError>) vs. returning a list of errors (e.g., Result<T, Vec>)?

Always single error. If you have multiple errors, it is probably not a true error in the error handling sense of the term, but more just an expected error in user input which should be handled as part of the 'happy path'

In complex async operations or batch processing, where multiple errors might occur, what would be the best way to handle error accumulation without losing key context?

Basically avoid this at all costs. Handle the error close to where it occured so you don't need to propagate. Treat errors as a form of the regular output where appropriate. If it's a library crate, let the user handle this; API should just look like single async functions which might error in a simple way. If you've got complex concurrent futures stuff going on, that is a smell that the library is doing too much orchestration.

When handling warnings, would you recommend keeping them local (i.e., logging only) or propagating them back to the caller?

In an app process them locally or treat them as part of the 'happy path' code rather than an error. In a library, just the latter.

How would you handle situations where a function should continue executing but may want to indicate that warnings occurred (e.g., via an is_warn boolean flag or a list of warnings)?

Warnings should be accumulated somewhere and returned as part of the normal execution flow, not treated as errors.

In async tasks and concurrent operations, how do you typically manage error propagation and structured logging, especially when errors are collected from multiple spawned tasks? How can we ensure we get full visibility into errors without complicating error management?

This is very hard! Let me know if you figure it out :-) Especially for a library rather than an app.

We also thought about a nested Result where the outer Result can contain hard errors that lead to aborting the program.

I would avoid over-engineering your error types. Keep it simple and keep error types just for unexpected errors.

Again, this is just my PoV and it is a rather opinionated one (some would call it extreme). Reasonable people may disagree and the specifics of a project take priority over general principles, however, I think this is a good starting point.

alilleybrinker commented 1 day ago

I'll chime in to say that I generally agree with Nick's perspective here. Errors are for things that get passed up the call chain 80% of the time or more. Recoverability for errors is not common, and when it's needed you generally would create a small error type indicating the recoverable cases. The rest of the time, in a binary crate, just use anyhow.

I maintain a crate called woah which is intended to be an ergonomic version of Result<Result<T, LocalErr>, FatalErr> as a single enum, but unfortunately the relevant trait, Try, is not stable (and likely won't be stable soon), so while it's ergonomic on nightly Rust builds, it's not very easy to use on stable. You can use it on stable, but you can't apply the ? operator to it.