Replace min/max, make quantile_mut return None for empty arrays, and fix docs

jturner314 commented 6 years ago

This PR does a few things:

Replace the old min/max methods with the old min_partialord/max_partialord. (Rename min_partialord/max_partialord to min/max.) See this comment for more explanation. The primary issue with the old min/max methods is that they panicked on empty arrays.
Make quantile_mut return None for empty arrays. It's easy to forget about the possibility of empty arrays, so it's important to remind the user of the possibility and force them to handle it. See this comment for more justification.
Fix the conditions for panicking in the docs of histogram strategies and quantile_mut.

Edit: I'm uncomfortable with the histogram strategies panicking in so many cases, but we can fix that in a later version.

LukeMathWalker commented 6 years ago

I'll review it tomorrow morning.

On histogram's panics: should we change the signature to return a Result? We panic in those occasions where it is not possible to provide an answer that makes sense:

what is the optimal bin width for a constant/empty array?
how do I handle degenerate cases (e.g. IQR=0 in Freedman Diaconis)?

Returning a Result would probably give a chance to the user to specify a default that makes sense to them, especially if we provide a useful error.

jturner314 commented 6 years ago

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

For Freedman Diaconis with IQR=0, I'm not sure; that requires some thought. Panicking definitely isn't the right thing, because IMO panics shouldn't be a function of the data in the array. I think I'd return an Option/Result or have a fallback strategy. At first glance, the Option/Result appeals to me more.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:

impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}

The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

LukeMathWalker commented 6 years ago

For empty arrays, I think the right thing to do is to have Edges be an empty Vec (no bins).

It makes sense, I agree.

For constant arrays, I think the right thing to do is have Edges be vec![the_constant_value] (no bins) or vec![the_constant_value, the_constant_value + 1] (one bin). Alternatively, if we change Edges to allow duplicate edges, we could use vec![the_constant_value; 2] (one bin).

We could use a "default" width of 1, that could be the way. I'd avoid duplicate edges since they would still lead to empty bins and a degenerate grid (no observation would be captured), thus surprising the user.

By the way, in a future version, I'm thinking about replacing GridBuilder with a fit_array method on Grid and simplifying the BinsBuildingStrategy trait:
impl<A: Ord> Grid<A> {
    fn fit_array<S, B>(array: &ArrayBase<S, Ix2>, strategy: &B) -> Result<Self, Error>
    where
        S: Data<Elem = A>,
        B: BinsBuildingStrategy<A>,
    {
        // ...
    }
}

pub trait BinsBuildingStrategy<A: Ord> {
    fn fit_array<S>(&self, array: &ArrayBase<S, Ix1>) -> Result<Bins<A>, Error>
    where
        S: Data<Elem = A>;
}
The BinsBuildingStrategy types wouldn't contain the min/max/n_bins/etc. derived from a specific array; they'd contain whatever metadata would be necessary to calculate the Bins in fit_array. This would simplify handling of special cases and would simplify the public API a little. What do you think about this?

A fit_array method sounds good to me, as a shortcut on Grid to simplify the public API, but you still need extremes and a bin width to execute all the strategies we have implemented, hence I think the new strategy would end up quite similar to the ones we have right now. They would probably just squeeze everything in a single method.

rust-ndarray / ndarray-stats

Replace min/max, make quantile_mut return None for empty arrays, and fix docs #13