Closed LukeMathWalker closed 6 years ago
I think this is a good approach.
A few thoughts:
Edges
needs to be careful about repeated elements. There are two possible strategies:
Edges
can make sure all of the elements are unique, in addition to being in order. The constructor could do this by removing duplicates or by returning an error when there are duplicates. This would make the rest of the implementation simpler but would be less convenient for the user.
Alternatively, it needs to be careful about identifying the correct bin. For example, the current implementation of .indexes()
doesn't correctly handle repeated edges, since in the case of repeated elements, .binary_search()
can return the index of any of those elements.
Assuming this strategy isn't much more difficult to implement, I prefer it (to make things simple for the user). It's also worth noting that NumPy uses this strategy.
I'd recommend removing the ndim
member from HistogramCounts
and instead adding a .ndim()
method that calls self.counts.ndim()
. (This means there are fewer things to keep in-sync.)
Edit: In the .ndim()
method, you could also have a line debug_assert_eq!(counts.ndim(), bins.len());
. (Even though this probably isn't necessary, I like using debug assertions like this because they're zero-cost in release mode and may catch bugs as things change in the future.)
In the future, it would be worth adding the dimension of counts
to the type of HistogramCounts
instead of always using ArrayD
. This isn't necessary for a first implementation, though.
In #8, I suggested separate HistogramCounts
and HistogramDensity
types. On further reflection, I don't think a HistogramDensity
type is worth adding. Instead, I'd suggest adding a .density()
method on HistogramCounts
that returns an array of densities. This does mean that if all you want is density, there's a conversion cost from counts -> densities once all the data have been counted, but I think the simplicity is worth the cost (and this approach may in-fact be cheaper anyway). (This also means that we can rename HistogramCounts
to just Histogram
.)
I agree on almost all points - I think it makes sense for Edges
to drop duplicates in the constructor (without panicking or raising an error): we pay a little more upfront (O(n)
where n
is the number of bins) but we benefit from faster add_observation
calls, which is where the bulk of the computation will be happening.
I'll work on the other points and then I think we can merge.
we benefit from faster
add_observation
calls
I'm not sure the difference would be significant. I think all we'd need to do is change
Ok(i) => Some((i, i + 1)),
to
Ok(mut i) => {
while i + 1 < n_edges && self.edges[i] == self.edges[i + 1] {
i += 1;
}
Some((i, i + 1))
}
which, if all the elements are unique, requires only two extra comparisons in this branch (and this branch is executed only when the value equals an edge).
That said, I'm fine with requiring unique edges for the time being. We can benchmark later to see if it really matters.
I agree that the performance gain is probably minimal, given that it is only triggered when we are adding an edge as an observation, but I don't see any real benefit from a user perspective in keeping duplicated edges around.
I was trying to write a test on add_observation
to assert that the Histogram::counts
matrix was equal to array![[0, 1], [0, 0]]
but the compiler of course complained that one is an instance of ArrayD
and the other one is an instance of Array2
.
Is there any way to convert a fixed-dimension array to a dynamically sized array in ndarray
? I tried to have a look in the docs, but I couldn't find anything related to it.
With respect to this PR:
Histogram
?Grid<A>
struct to encapsulate Vec<Bins<A>>
and provide a bunch of helper methods?I don't see any real benefit from a user perspective in keeping duplicated edges around.
My primary concern is a user writing code that assumes that the number of edges in the histogram is the same as the number of edges they provided.
I agree that deduplicating edges in the Edges
constructor is reasonable behavior, though. We just need to make it clear in the docs that this is what we're doing so that the user doesn't assume the number of edges is unchanged.
Is there any way to convert a fixed-dimension array to a dynamically sized array in
ndarray
?
Is .into_dyn()
what you're looking for?
What strategy would you suggest to add a dimension parameter to
Histogram
?
I'm thinking something along the lines of this:
pub struct Histogram<A: Ord, D: Dimension> {
counts: Array<usize, D>,
bins: Vec<Bins<A>>,
}
impl<A: Ord, D: Dimension> Histogram<A, D> {
/// **Panics** if the length of `bins` does not match `D`.
pub fn new(bins: Vec<Bins<A>>) -> Self {
let mut shape = D::zero_index_with_ndim(bins.len());
bins.iter().zip(shape.slice_mut()).for_each(|b, s| *s = b.len());
let counts = Array::zeros(shape);
Histogram { counts, bins }
}
}
pub trait HistogramExt<A, S>
where
S: Data<Elem = A>,
{
/// **Panics** if `d` is different from `bins.len()` or if it does not match `D`.
fn histogram<D>(&self, bins: Vec<Bins<A>>) -> Histogram<A, D>
where
A: Ord;
}
The implementation for new
uses some undocumented methods of the Dimension
trait. I don't see any good way to do it otherwise; I can modify ndarray
to expose these.
Usage looks like this:
// dynamic dimension
let hist = arr.histogram::<IxDyn>(bins);
// or, if we define type aliases,
let hist: HistogramD<_> = arr.histogram(bins);
// fixed dimension; panics if `bins.len() != 2` or if `arr.len_of(Axis(1)) != 2`
let hist = arr.histogram::<Ix2>(bins);
Would it make sense to introduce a
Grid<A>
struct to encapsulateVec<Bins<A>>
and provide a bunch of helper methods?
I think so. For example, as a user, it would be nice to be able to easily figure out the bin corresponding to a data point (using a method on Grid
), and then look up the corresponding count. We could just place all these methods on Histogram
instead of a separate Grid
type, but I could see there being cases where the user would want to decompose the Histogram
into separate Grid
and Array
pieces.
I have added strategies to infer the optimal bin edges as well as a builder to get a grid using those strategies on each coordinate axis (right now it is constrained to use the same strategy across all axes, but we can make it more versatile in the future).
I have documented all methods and added key docs example where needed.
I think we are good to go to merge it as a first implementation @jturner314.
I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!
I have also reduced the number of types parameter on BinsBuildingStrategy
from 2 to 1 using the strategy you suggested (associated type on trait).
With respect to the issue of bin boundaries I proceeded as follows: all strategies now provide an optimal bin width (either directly through the rule they specify or indirectly recasting the optimal number of bins to an optimal bin width) and we make sure that the last edge is strictly greater than the maximum of the array that has been passed to the builder. This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?
I managed to address all comments (and I fixed the bug you spotted) - they were all on point, thanks for taking the time to go through the PR with that level of attention!
You're welcome. Thanks for working on this! Would you like me to review the updated version, or do you think it's ready to merge?
This basically means that we might add an extra bin to the right if it is required to account for the maximum value, but we don't modify the grid parameters to achieve it. What do you think?
I think that's fine.
For future versions, I think it would be worth investigating how Julia's StatsBase does things, because StatsBase is similar to our implementation in using all half-open intervals (unlike NumPy, which considers the last bin to be a closed interval).
Let me add a couple of tests and then I think we are good to go!
Based on our discussion in #8, I have revised the implementation.
It needs a testing suite and a couple more helper methods on
HistogramCounts
but the skeleton it's there. Let me know your thoughts @jturner314