statrs-dev / statrs

Statistical computation library for Rust
https://docs.rs/statrs/latest/statrs/
MIT License
603 stars 84 forks source link

Traits in `statistics::traits` should not require return type be `Option` #295

Open YeungOnion opened 2 months ago

YeungOnion commented 2 months ago

Many distributions will have some summary statistics regardless of their parameters so Option<T> should not be required, e.g. it is not semantically accurate to require unwrapping the mean of a binomial. Regardless of the sample probability, a valid Binomial type will always have some mean, but student's distribution require at least one dof, cauchy has too much tail and has no mean is never valid.

proposed solution

In lieu of generics, we can use associated types for the statistics::traits::Distribution trait[^renameDistribution]. I don't believe this would bound us in terms of generic numerics in the future either, as long as the associated type could support the generic as well, i.e.

impl<T: Float> Distribution for DistributionType {
    type Mu = T;   // or Vector<T, Dimension, ...>
    type Var = T;  // or TensorCube<T, Const<2>, Dimension, ...>
    type Skew = T; // or TensorCube<T, Const<3>, Dimension, ...>
    type Kurt = T; // or TensorCube<T, Const<4>, Dimension, ...>
}

Sample code of the traits I propose adding.

pub trait Moments<T: Float> {
    type Mu;
    type Var: Covariance<T>;
    type Kurt;
    type Skew;
    fn mean(&self) -> Self::Mu;
    fn variance(&self) -> Self::Var;
    fn std_dev(&self) -> <Self::Var as Covariance<T>>::V;
    fn excess_kurtosis(&self) -> Self::Kurt;
    fn skewness(&self) -> Self::Skew;
}

pub trait Covariance<T: Float> {
    type M;
    type V;
    fn dense(&self) -> Self::M;
    fn sparse(&self) -> Self::V;
    fn forward(&self, other: Self::V) -> Self::V;
    fn inverse(&self, other: Self::V) -> Self::V;
    fn determinant(&self) -> T;
}

pub trait Entropy<T: Float> {
    fn entropy(&self) -> T;
}

[^renameDistribution]: I also think renaming it could be helpful, unsure what is better, but since most of them are moments, or central moments, we could make entropy it's own, as entropy is always scalar and have a Moments trait. I think much of this is motivated from reference implementations in Math.NET's interfaces since that's how this project started.

YeungOnion commented 2 months ago

Also going to say that I'm not tied to this implementation, but I align well to the issue title. I don't think it makes sense to return Option for all of them.

FreezyLemon commented 2 months ago

I think your proposal is good. It should be a clear benefit to avoid returning Option if it's statically known to not be needed. Some unordered thoughts:

[^1]: Excluding std_dev [^2]: Another thing: The default implementations have function documentation following the pattern "Returns X, if it exists". But we know that all of these return None unless overridden by a manual implementation (which should have its own, overriding docs) so this text almost always actually means "Returns None"

YeungOnion commented 2 months ago

I hadn't thought of this one, but the API I designed doesn't work well for fallible types,

I can't impl<T> Covariance<Option<T>> for Option<T> separately from impl<T> Covariance<T> for T because of the standard deviation[^impl_cov_for_option_t]. Standard deviation could be its own trait or simply implemented on types where it is useful[^1], and then I could drop the "note to implementors" on std_dev method, quoted what I came up with below,

[^impl_cov_for_option_t]: I don't think it would be good semantics to do so, even if it is more ergonomic, calling covariance methods in map on Option<T: Covariance> types would be better than Option<f64>::default().forward(1).

[^1]: at the moment, this is my preference.

Note for implementors

Associated types should capture semantics of the distribution, e.g. [Option] should be used where moments is defined or undefined based on parameter values. [()] is used for moments that are never defined

FreezyLemon commented 2 months ago

I can't impl<T> Covariance<Option<T>> for Option<T> separately from impl<T> Covariance<T> for T because of the standard deviation.

I'm not sure I understand this point.. can you elaborate or maybe give an example? It seems possible but I'm not sure I understand your proposal correctly.

EDIT: I just saw #304. I'll take a look and see if I can't figure it out

YeungOnion commented 2 months ago

To couple the types for the return of variance and std_dev I have Type Var: Covariance and the return type of std_dev is <Var as Covariance>::V

Perhaps I should be using the return of variance as the type from dense, Covariance::M, but the other option is not constraining a type relationship between variance and std_dev by removing std_dev or adding another associated type (which I think is unnecessarily flexible).

EDIT: on mobile, but within 24h I can push the WIP commit. It won't compile, but it's where I first noticed it; was while implementing moments for the F-distribution.