rust-ndarray / ndarray-stats

Statistical routines for ndarray
https://docs.rs/ndarray-stats
Apache License 2.0
201 stars 25 forks source link

Replace min/max, make quantile_mut return None for empty arrays, and fix docs #13

Closed jturner314 closed 6 years ago

jturner314 commented 6 years ago

This PR does a few things:

Edit: I'm uncomfortable with the histogram strategies panicking in so many cases, but we can fix that in a later version.

LukeMathWalker commented 6 years ago

I'll review it tomorrow morning.

On histogram's panics: should we change the signature to return a Result? We panic in those occasions where it is not possible to provide an answer that makes sense:

Returning a Result would probably give a chance to the user to specify a default that makes sense to them, especially if we provide a useful error.

jturner314 commented 6 years ago

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

For Freedman Diaconis with IQR=0, I'm not sure; that requires some thought. Panicking definitely isn't the right thing, because IMO panics shouldn't be a function of the data in the array. I think I'd return an Option/Result or have a fallback strategy. At first glance, the Option/Result appeals to me more.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:

impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}

The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

LukeMathWalker commented 6 years ago

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

It makes sense, I agree.

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

We could use a "default" width of 1, that could be the way. I'd avoid duplicate edges since they would still lead to empty bins and a degenerate grid (no observation would be captured), thus surprising the user.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:

impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}

The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

A fit_array method sounds good to me, as a shortcut on Grid to simplify the public API, but you still need extremes and a bin width to execute all the strategies we have implemented, hence I think the new strategy would end up quite similar to the ones we have right now. They would probably just squeeze everything in a single method.