statrs-dev / statrs

Statistical computation library for Rust
https://docs.rs/statrs/latest/statrs/
MIT License
578 stars 83 forks source link

API discussion #117

Closed troublescooter closed 3 years ago

troublescooter commented 3 years ago

From my point of view, there are some issues with the current API that make the implementation of new distributions less ideal, make it less suited to handle extensions of functionality and make using this library more difficult than necessary. This issue serves to initiate the discussion on said problems and suggest changes.

Traits

Traits generally serve to increase the upfront investment into an API to reduce the running costs of implementation. The API of a library becomes more complex the more public traits it exposes. This complexity usually pays off by reducing code duplication in generic programming, but in order to do that the implementations of composed structures must be in some way derivable from the implementations of the substructures.

  1. The global properties of distributions Mean, Variance, Entropy and Skewness are not modular. There are very few ways to compose distributions that allow deriving of mean, variance, ... from the means and variances of the distributions that make up the composition. Even if desired, expressing this composition in types is a challenge.
  2. The fragmentation of traits increases the overhead of implementing a new distribution. Implementors must pick the right traits and assemble them. The compiler cannot help determine whether anything is missing.
  3. It's not obvious to me that there is any gain to a user of the library whether a given property (a) doesn't exist in closed form, where the implementation of the trait means to have a closed form according to the current documentation, (b) is known to not exist as a limit or (c) there exists some closed-form solution that may rely on special functions which simply have not been implemented yet. It's debatable of what use it is to incorporate any definition of closed-form in a trait whose method outputs T. The user will probably care most about whether a reasonable numerical estimate can be given for the distribution in question, but also defining what is closed-form is difficult. The current definition also precludes the implementation of the trait for empiric distributions or completely arbitrarily constructed distributions from a closure, which seems like a reasonable extension to support. In any case, it's likely there will be an interest in numerically determining the property of the distribution, whether there is a nice algebraic formula representing the number or not. Does this library provide an output for this distribution that can be reasonably calculated with? If no, the user should be prevented from accidentally performing calculations with the output, but there's no complex control flow involved, so Option<T> should suffice. Being able to use ? to short-circuit a calculation of the variance depending on the mean is one small improvement that is gained. Result<T> seems of limited additional use compared to Option<T>, so I do not see Checked versions as necessary.
  4. The types in the output of the traits are too general, disabling boilerplate savings. A default implementation of std_dev could be provided if T were restricted to T: num_traits::float::Float.

Putting the above together, my suggestions are to

  1. ... reduce at least the traits Mean, Variance, Entropy and Skewness of the 1D distributions to one single one, and make it a subtrait of Distribution. Remove Checked* traits.
  2. ... make the return types of these 1D methods Option<T: num_traits::float::Float>
  3. ... treat the non-existence of a limit as None in the output of these methods.
  4. ... make this single trait an extension trait to Distribution, providing a default implementation that uses Distribution::sample to Monte-Carlo-estimate the desired statistical property, or if there's an indication of this not converging after a given number of steps output None. An implementation for a concrete distribution will likely have more concrete knowledge of what the actual values are, and can override this numerical estimate. Higher moments can easily be provided despite not having closed forms. Summarised 1.-4. (simplified code)
    pub trait Bikeshed1D<T: num_traits::float::Float>: Distribution<T> {
    fn mean(&self) -> Option<T> {
        let mut sum = 0.0; 
        for _ in 0..STEPS {
            sum += Self::sample(...);
        }
        Some(sum/STEPS)
    }
    fn variance(&self) -> Option<T> {
         todo!()
    }
    fn std_dev(&self) -> Option<T> {
        self.variance().map(|var| var.sqrt())
    }
    fn entropy(&self) -> Option<T> {
        todo!()
    }
    fn skewness(&self) -> Option<T> {
        todo!()
    }
    fn moment(&self, i: usize) -> Option<T> {
        todo!()
    }
    }

    This enables to get all these statistical functions for free on struct GenericDistribution(Box<Fn(f64) -> f64>) by implementing Distribution.

  5. ... analyse the multidimensional distributions separately. If algebraic operations like forming vector-valued distributions from lower-dimensional ones are to be implemented, it would make sense to extend the return values of the mean appropriately to the 1D-case. The most general return type I can currently see being supported is matrix-valued output.

Other

troublescooter commented 3 years ago

Closed by #120 .