Histogram (revisited) - Githubissues

LukeMathWalker commented 6 years ago

Based on our discussion in #8, I have revised the implementation.

It needs a testing suite and a couple more helper methods on HistogramCounts but the skeleton it's there. Let me know your thoughts @jturner314

jturner314 commented 6 years ago

I think this is a good approach.

A few thoughts:

Edges needs to be careful about repeated elements. There are two possible strategies:
1. Edges can make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user.
2. Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of .indexes() doesn't correctly handle repeated edges, since in the case of repeated elements, .binary_search() can return the index of any of those elements.
  
  Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.
I'd recommend removing the ndim member from HistogramCounts and instead adding a .ndim() method that calls self.counts.ndim(). (This means there are fewer things to keep in-sync.)

Edit: In the .ndim() method, you could also have a line debug_assert_eq!(counts.ndim(), bins.len());. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.)
In the future, it would be worth adding the dimension of counts to the type of HistogramCounts instead of always using ArrayD. This isn't necessary for a first implementation, though.
In #8, I suggested separate HistogramCounts and HistogramDensity types. On further reflection, I don't think a HistogramDensity type is worth adding. Instead, I'd suggest adding a .density() method on HistogramCounts that returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can rename HistogramCounts to just Histogram.)

LukeMathWalker commented 6 years ago

I agree on almost all points - I think it makes sense for Edges to drop duplicates in the constructor (without panicking or raising an error): we pay a little more upfront (O(n) where n is the number of bins) but we benefit from faster add_observation calls, which is where the bulk of the computation will be happening.

I'll work on the other points and then I think we can merge.

jturner314 commented 6 years ago

we benefit from faster add_observation calls

I'm not sure the difference would be significant. I think all we'd need to do is change

Ok(i) => Some((i, i + 1)),

to

Ok(mut i) => {
    while i + 1 < n_edges && self.edges[i] == self.edges[i + 1] {
        i += 1;
    }
    Some((i, i + 1))
}

which, if all the elements are unique, requires only two extra comparisons in this branch (and this branch is executed only when the value equals an edge).

That said, I'm fine with requiring unique edges for the time being. We can benchmark later to see if it really matters.

LukeMathWalker commented 6 years ago

I agree that the performance gain is probably minimal, given that it is only triggered when we are adding an edge as an observation, but I don't see any real benefit from a user perspective in keeping duplicated edges around.

I was trying to write a test on add_observation to assert that the Histogram::counts matrix was equal to array![[0, 1], [0, 0]] but the compiler of course complained that one is an instance of ArrayD and the other one is an instance of Array2. Is there any way to convert a fixed-dimension array to a dynamically sized array in ndarray? I tried to have a look in the docs, but I couldn't find anything related to it.

With respect to this PR:

What strategy would you suggest to add a dimension parameter to Histogram?
Would it make sense to introduce a Grid<A> struct to encapsulate Vec<Bins<A>> and provide a bunch of helper methods?

jturner314 commented 6 years ago

I don't see any real benefit from a user perspective in keeping duplicated edges around.

My primary concern is a user writing code that assumes that the number of edges in the histogram is the same as the number of edges they provided.

I agree that deduplicating edges in the Edges constructor is reasonable behavior, though. We just need to make it clear in the docs that this is what we're doing so that the user doesn't assume the number of edges is unchanged.

Is there any way to convert a fixed-dimension array to a dynamically sized array in ndarray?

Is .into_dyn() what you're looking for?

What strategy would you suggest to add a dimension parameter to Histogram?

I'm thinking something along the lines of this:

pub struct Histogram<A: Ord, D: Dimension> {
    counts: Array<usize, D>,
    bins: Vec<Bins<A>>,
}

impl<A: Ord, D: Dimension> Histogram<A, D> {
    /// **Panics** if the length of `bins` does not match `D`.
    pub fn new(bins: Vec<Bins<A>>) -> Self {
        let mut shape = D::zero_index_with_ndim(bins.len());
        bins.iter().zip(shape.slice_mut()).for_each(|b, s| *s = b.len());
        let counts = Array::zeros(shape);
        Histogram { counts, bins }
    }
}

pub trait HistogramExt<A, S>
where
    S: Data<Elem = A>,
{
    /// **Panics** if `d` is different from `bins.len()` or if it does not match `D`.
    fn histogram<D>(&self, bins: Vec<Bins<A>>) -> Histogram<A, D>
    where
        A: Ord;
}

The implementation for new uses some undocumented methods of the Dimension trait. I don't see any good way to do it otherwise; I can modify ndarray to expose these.

Usage looks like this:

// dynamic dimension
let hist = arr.histogram::<IxDyn>(bins);
// or, if we define type aliases,
let hist: HistogramD<_> = arr.histogram(bins);
// fixed dimension; panics if `bins.len() != 2` or if `arr.len_of(Axis(1)) != 2`
let hist = arr.histogram::<Ix2>(bins);

Would it make sense to introduce a Grid<A> struct to encapsulate Vec<Bins<A>> and provide a bunch of helper methods?

I think so. For example, as a user, it would be nice to be able to easily figure out the bin corresponding to a data point (using a method on Grid), and then look up the corresponding count. We could just place all these methods on Histogram instead of a separate Grid type, but I could see there being cases where the user would want to decompose the Histogram into separate Grid and Array pieces.

LukeMathWalker commented 6 years ago

I have added strategies to infer the optimal bin edges as well as a builder to get a grid using those strategies on each coordinate axis (right now it is constrained to use the same strategy across all axes, but we can make it more versatile in the future).

I have documented all methods and added key docs example where needed.

I think we are good to go to merge it as a first implementation @jturner314.

LukeMathWalker commented 6 years ago

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

I have also reduced the number of types parameter on BinsBuildingStrategy from 2 to 1 using the strategy you suggested (associated type on trait).

With respect to the issue of bin boundaries I proceeded as follows: all strategies now provide an optimal bin width (either directly through the rule they specify or indirectly recasting the optimal number of bins to an optimal bin width) and we make sure that the last edge is strictly greater than the maximum of the array that has been passed to the builder. This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

jturner314 commented 6 years ago

I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!

You're welcome. Thanks for working on this! Would you like me to review the updated version, or do you think it's ready to merge?

This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?

I think that's fine.

For future versions, I think it would be worth investigating how Julia's StatsBase does things, because StatsBase is similar to our implementation in using all half-open intervals (unlike NumPy, which considers the last bin to be a closed interval).

LukeMathWalker commented 6 years ago

Let me add a couple of tests and then I think we are good to go!

rust-ndarray / ndarray-stats

Histogram (revisited) #9