statrs-dev / statrs

Statistical computation library for Rust
https://docs.rs/statrs/latest/statrs/
MIT License
603 stars 84 forks source link

Consider improving error handling and `StatsError` type #221

Closed FreezyLemon closed 2 months ago

FreezyLemon commented 7 months ago

enum StatsError is supposed to be an

Enumeration of possible errors thrown within the statrs library.

This indicates that it should not try to be a generic error type for any kind of statistics calculations, but instead only concern itself with the errors produced in statrs.

With that basic assumption, there are currently some inconsistencies and outdated practices though (API guidelines for reference):

I realize that most of these are breaking changes, but seeing that the crate is pre-1.0, I don't think there's a big problem doing this.

Other things that could be improved: StatsError is big: 40 bytes on a Linux x64 target. This is because there are variants which contain 2x &str (2 x 16 = 32 bytes plus discriminant and padding). Is it really necessary to have strings in the error type? The implementation could be replaced mostly 1:1 with some sort of ArgName enum, but there might be an even better solution that does not need this argument name wrapping at all.

All new functions seem to just return StatsError::BadParams whenever the params are outside of the defined/allowed range. Is there a good reason for these to be so vague when compared to the more specific errors returned by other functions? After all, the more specific errors already exist, why not return more exact error information? There might even be value in providing multiple error types, to have errors that are more specific to the exact API being used.

YeungOnion commented 7 months ago

I do see good reason for all of these and I'd be open to making changes for all but the infallible new not returning a result.

All other public structs defined in the distribution module have a new method implemented that returns Result so it does provide consistency. Perhaps if Empirical were in a different module or it's new had a different name?

FreezyLemon commented 7 months ago

All other public structs defined in the distribution module have a new method implemented that returns Result so it does provide consistency. Perhaps if Empirical were in a different module or it's new had a different name?

I see your point about consistency. I would personally value the expressiveness ("this call cannot fail") over the consistency ("all constructors return a Result and might need error handling"), but it doesn't matter much tbh.

Hmm I'm not sure about renaming the new function, it's a widespread naming convention in the Rust ecosystem and what most users would expect. Maybe a similar name like new_<something> so people can quickly find it in their IDEs.

YeungOnion commented 7 months ago

Perhaps an impl Default over having Empirical::new?

Regardless, the overall discussion you bring up on the error type is valid. I'd merge an effort that does any of

YeungOnion commented 5 months ago

During these changes, we should also work on removing support for degenerate distributions #102 and returning errors.

YeungOnion commented 4 months ago

Tried a different approach, even wrote out an example to test using it, but I didn't have a great motivating case, so I just made something up, but I wanted to see what it was like to actually handle those such errors.

YeungOnion commented 3 months ago

Maybe we should figure out what the errors should be able to convey first. Since @FreezyLemon posted on #242 I've realized I've not done enough of this bigger picture stuff for decisions. I think it's fair to narrow the discussion to errors from distribution construction since that's at a module scope and most of where errors are handled.

The value in #258 is to make the errors flat, and delegates identifying what we call "bad parameters" to the caller. This is quite reasonable, as the conditions for errors are statically specified (up to decisions about what constitutes degeneracy), but there is no program state in which parameter conditions change. However it means that users of the library may repeat their work among projects where there's some input-validation cycle if they want useful error messages.

The option I have in #247 is pretty dense, and it means all distributions' new have a different type for the Err variant. Would need quite a few imports just to do something that creates multiple distributions from different families. But it has the value of granularity.


A few lists to help organize my thoughts:

Assumptions

Possible targeted behaviors adjacent to error API

I'm very open to modifying these lists.

YeungOnion commented 3 months ago

I realize yet another option is to have the distribution error have too many variants for any one distribution, but use shared language location, scale, shape, etc. in the variants. Suppose this is called ParameterError. Then in each distribution's module, impl From<super::ParameterError> for the distribution's specific error or use some generics to impl Error with source converting to the right error.

This allows a shallow error, small step from unwrap for those writing analyses. And allows for deeper error which can rely more on handling programmatically. But I don't think this is a common idiom, the more I think about this, the more certain there's something easier I'm missing here and I don't think it's needed to replicate std::io's pattern of error handling.

FreezyLemon commented 3 months ago

the more I think about this, the more certain there's something easier I'm missing here and I don't think it's needed to replicate std::io's pattern of error handling.

Most usages of StatsError are currently for the return value of SomeDistribution::new(...). Since this function is implemented per-distribution and not to satisfy some trait, the simplest solution here would be creating a concrete error type for each and every distribution in case the construction fails. For example:

Let's turn

// beta.rs

#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Beta {
    shape_a: f64,
    shape_b: f64,
}

impl Beta {
    pub fn new(shape_a: f64, shape_b: f64) -> Result<Beta> {
        if shape_a.is_nan()
            || shape_b.is_nan()
            || shape_a.is_infinite() && shape_b.is_infinite()
            || shape_a <= 0.0
            || shape_b <= 0.0
        {
            return Err(StatsError::BadParams);
        };
        Ok(Beta { shape_a, shape_b })
    }
}

into

#[derive(Copy, Clone, PartialEq, Debug)]
pub struct Beta {
    shape_a: f64,
    shape_b: f64,
}

#[derive(Copy, Clone, PartialEq, Eq, Debug)]
pub enum BetaError {
    ShapeANan,
    ShapeAInfinite,
    ShapeALessThanZero,
    ShapeBNan,
    ShapeBInfinite,
    ShapeBLessThanZero,
}

impl std::fmt::Display for BetaError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            BetaError::ShapeANan => write!(f, "Shape A is NaN"),
            // ...
        }
    }
}

impl std::error::Error for BetaError {}

impl Beta {
    pub fn new(shape_a: f64, shape_b: f64) -> Result<Beta, BetaError> {
        if shape_a.is_nan() {
            Err(BetaError::ShapeANan)
        } else if shape_a.is_infinite() {
            Err(BetaError::ShapeAInfinite)
        } else if shape_a <= 0.0 {
            Err(BetaError::ShapeALessThanZero)
        } else if shape_b.is_nan() {
            Err(BetaError::ShapeBNan)
        } else if shape_b.is_infinite() {
            Err(BetaError::ShapeBInfinite)
        } else if shape_b <= 0.0 {
            Err(BetaError::ShapeBLessThanZero)
        } else {
            Ok(Beta { shape_a, shape_b })
        }
    }
}

Maybe this example is too verbose (i.e. maybe BetaError::ShapeAInvalid would be enough), but it gets the point across.

This will return very specific errors, and users that don't care about it can still .unwrap() or use pattern matching to handle whichever errors. This will not impact the distributions' traits (because new is not a part of any trait) and should allow for dynamic dispatch.

If new methods of creating a distribution are added later, the related error type could be extended or a separate error type, specific to the method of construction, could be added.

It will need a fair bit of boiler plate, especially now that it needs to be implemented all at once, but is otherwise very simple to implement. A crate like thiserror could cut down on boilerplate, but is honestly not needed. The creation errors are simple and thiserror is/has a proc-macro which means it has a negative impact on compilation times.

I might be missing something here, but this is the obvious KISS solution IMHO

YeungOnion commented 3 months ago

Ah, it's true that the generic part can only come up once the type is correctly initialized.

I'm good with only specifying which parameter is invalid, so we can just have one variant per parameter, the one message denoting that one of its (potentially multiple) constraints was violated.

impl std::fmt::Display for BetaError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            BetaError::ShapeAInvalid => write!(f, "Shape A must be positive finite"),
            BetaError::ShapeBInvalid => write!(f, "Shape B must be positive finite"),
        }
    }
}

It will be a little boilerplatey, but it need not be flexible, sounds like a text editor problem.

FreezyLemon commented 3 months ago

Alright, I'll prepare another draft pr to compare against the rest if you don't mind

YeungOnion commented 3 months ago

Thanks for this, I'll keep an eye out.