Roadmap - Githubissues

LukeMathWalker commented 6 years ago

In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in numpy (here) and Julia StatsBase (here).

For the next version:

Order statistics:
- [ ] partialord version for quantiles methods;
Histograms:
- [ ] merge method;

For version 0.2.0:

Order statistics:
- [x] optimized computations of multiple quantiles if requested all at once (#26) ;
- [x] argmin / argmax (#30);
Summary statistics:
- [x] harmonic mean (#20);
- [x] geometric mean (#20);
- [x] higher order central moments (#23);
- [x] standardized moments (they include kurtosis and skewness) (#23);
Histograms:
- [x] Fix error handling (Issue: https://github.com/jturner314/ndarray-stats/issues/16 - PR: #25 )
Entropy:
- [x] Feature parity with StatsBase.jl (#24)

For version 0.1.0:

[x] max / nanmax (@jturner314)
[x] min / nanmin (@jturner314)
[x] quantile / nanquantile (it includes percentile / nanpercentile as a special case) (@LukeMathWalker & @jturner314)
[x] correlation-methods:
- [x] cov (@LukeMathWalker) - ~One last fix to be made (#3)~ [On hold for now]
- [x] corrcoef (@LukeMathWalker - #5)
[x] histogram-methods (@LukeMathWalker - #9)

jturner314 commented 6 years ago

With respect to mean, average, std: var is implemented in the main ndarray crate - would it make sense to port it here?

I think it makes sense for ndarray-stats to provide *_skipnan variants (or whatever you want to call them) of those methods. However, it would make sense to add std_axis to ndarray since ndarray already has var_axis.

For methods that are already in ndarray, we could duplicate these methods as a trait in ndarray-stats for people who want to write generic code (where the implementations just call the instance methods). I'm ambivalent on this.

I'll slowly start working on this next week and then I'll get serious the week afterwards. Could you please give me commit/PR permissions to the repository @jturner314?

Okay, that sounds good. I've given you push access. Alternatively, if you'd like to have your repo be the main one instead of this one, that would be fine with me.

LukeMathWalker commented 6 years ago

Once #9 gets merged I think we are in a good position to officially release version 0.1.0 on crates.io - what do you think? @jturner314

jturner314 commented 6 years ago

I agree.

By the way, I recently came across Julia's StatsBase.jl library. It's a good source of ideas in addition to NumPy/SciPy.

LukeMathWalker commented 6 years ago

Added a bunch of tests to #9 and merged 🎉 It feels like ages since I started to work on it :sweat_smile: Your contribution was extremely helpful to get it in the shape it is right now, thanks a lot @jturner314!

What do we need to do in order to release on crates.io? I am going to open a small PR to add crate-level documentation - a couple of lines, nothing major.

jturner314 commented 6 years ago

Yay! :tada: That was a big job; great work.

What do we need to do in order to release on crates.io?

Ideally, we'd eliminate the [patch.crates-io] section from the Cargo.toml before we can release on crates.io. (This might even be required, I'm not sure.) #11 removed the patch for noisy_float, but a new version of ndarray will need to be released for us to remove its patch. It would be nice to merge a couple more ndarray PRs before release; I'll take a look.

It would also be good to merge #12 and #13 before releasing.

LukeMathWalker commented 6 years ago

Merged #12 and #13 - looking around it seems we can publish with [patch.crates-io] section in Cargo.toml, but I agree it is much nicer to point to ndarray 0.12.1 as a dependency instead of a revision on master.

Let's wait for that release and then we are good to go.

jturner314 commented 6 years ago

ndarray-stats 0.1.0 is now on crates.io. :tada: Thanks for all your hard work @LukeMathWalker!

LukeMathWalker commented 6 years ago

💯 💯 I think it's safe to say it would have never got there without your help 😛 I'll drop a post on r/rust as well 👍

LukeMathWalker commented 6 years ago

I have drafted a tentative roadmap with the features I'd like to add in the next release - please edit it with your comments and suggestions @jturner314

jturner314 commented 6 years ago

The roadmap looks good to me. I'm not familiar with the applications of higher order central moments (I'd usually use a histogram instead), but I don't mind adding them if people find them useful.

By the way, I invited you as an owner for the ndarray-stats crate, but I just realized that crates.io may not have sent the invitation if you haven't logged in before. Please let me know if you need me to re-send it.

LukeMathWalker commented 5 years ago

Somehow I didn't receive an email notification, but the invite was on my dashboard - accepted it!

The main objective in that area is getting kurtosis and skewness, and given the kind of computation required to achieve that it makes sense to also roll out higher order central moments I'd say :)

phungleson commented 5 years ago

Hey mate, argmin / argmax looks like simple enough to look into, do you have any suggestions of where to start?

jturner314 commented 5 years ago

Thanks for your interest! You'll want to add argmin and argmax methods to the QuantileExt trait and implement them. Please include documentation for the methods and some tests (in tests/quantile.rs).

I'd suggest starting with the existing implementation for min as a basis, but using .indexed_iter().fold() or .indexed_iter().try_fold() instead of .fold().

It would also be good to add argmin_skipnan and argmax_skipnan methods (analogous to min_skipnan and max_skipnan, but that's not necessary for the first PR.

Please feel free to ask if you have any questions.

phungleson commented 5 years ago

Hey mates, I have added argmin_skipnan and argmax_skipnan, wonder why you use PartialOrd for min, but Ord for min_skipnan?

And what does this mean by this? partialord version for quantiles

LukeMathWalker commented 5 years ago

It's because we require the data type to be MaybeNan: it basically means that, apart from a subset of elements (e.g. NaN for floats), we are dealing with a data type that is totally ordered (all pairs of elements can be compared, Ord).

This reduces the failure scope:

min can return None is a comparison fails (as it can happen, with PartialOrd) or if there is no element in the array.
min_skipnan returns None if and only if the array has no not-NaN element (because no comparison will be undefined).

This can be useful when you are dealing with floats or arrays with potentially missing values (e.g. Option<A>, where A: Ord).

Re: quantiles - the current implementation requires A to implement Ord. We'd like to relax it to allow A to be PartialOrd instead of Ord.

phungleson commented 5 years ago

Thanks @LukeMathWalker for the last point, if we change A: Ord to A: PartialOrd and refactor the code + test to allow that change, it would complete the task right?

LukeMathWalker commented 5 years ago

Exactly! @phungleson I'd suggest you to wait until #26 is merged before tackling this task, otherwise you are in for some nasty merge conflicts :stuck_out_tongue: I am almost there, I am just investigating some stack overflow errors in the revised version I have been writing.

phungleson commented 5 years ago

Cool thanks @LukeMathWalker so seems like everything is more or less complete? Let me know if there are any doable features, cheers.

BTW merge method; seems to be straight forward but do you have any thoughts yet about the implementation?

phungleson commented 5 years ago

For merge I read quickly, so basically just adding the weights?

for h in others
  target.weights .+= h.weights
end

LukeMathWalker commented 5 years ago

Yes @phungleson, it basically boils down to summing together the weight matrices (plus or minus checking that their dimension/bins are compatible, I haven't looked into it). If you want to give it try, please go ahead!

LukeMathWalker commented 5 years ago

I'd like to close existing work streams and cut a release - what does your bandwidth look like @jturner314 to review open PRs?

jturner314 commented 5 years ago

I've been meaning to look over the open PRs but haven't had a chance. I'll reserve time on Sunday to review them.

LukeMathWalker commented 5 years ago

It seems I managed to publish 0.2.0 without making a mess :muscle: Thanks @jturner314 @phungleson and @munckymagik for all the work done on this release :heart:

I'd say we have done a major leap forward in terms of features - there are things that can be polished, the API design can be further improved and we can optimize the existing code, but ndarray-stats is definitely a viable solution right now :rocket:

I'll clean up the parent post to move items that we didn't manage to include in this release to the roadmap for the next one. I am not sure what we should be covering next in terms of major new functionality :thinking:

munckymagik commented 5 years ago

Well done all 👏

jturner314 commented 5 years ago

Great job on 0.2.0 everyone!

I am not sure what we should be covering next in terms of major new functionality

A couple of ideas from StatsBase.jl:

Deviation functions
Weighted calculations (mean/std/etc.)

We could also add statistical models (e.g. linear regression), but that might be best put in a separate crate.

phungleson commented 5 years ago

Well done! cheers!

munckymagik commented 5 years ago

A couple of ideas from StatsBase.jl:

Deviation functions

Weighted calculations (mean/std/etc.)

Unless any of you have made a start on these, I'd be interested in having a go at either, or contributing. I'll try to spend some time in the next couple of days looking at what is involved with the Deviation functions.

❓ Does anyone have any implementation suggestions other than just trying to port from StatsBase.jl?

If anyone wants to collaborate on the code then let me know.

munckymagik commented 5 years ago

Ok I made a start: https://github.com/jturner314/ndarray-stats/pull/41

Any advice for choosing traits bounds for the A element types? Is it ok to use Copy or do we need to support any types that would be Clone?

LukeMathWalker commented 5 years ago

I'd say to use clone @munckymagik

munckymagik commented 5 years ago

@LukeMathWalker thanks. What led you to that decision? Is there a particular data type you've seen used in ndarrays that would need this? If so I'm thinking I might use it in the test fixtures to make sure all methods have the same bounds.

LukeMathWalker commented 5 years ago

I see it as a tradeoff between convenience and generality - I am not personally aware of any "popular" numerical type that is not Copy, but the cost of weakening it to Clone is so low that I see it as safe future-proofing @munckymagik

nilgoyette commented 5 years ago

I wanted to code a simple weighted_mean for myself then contribute it

pub fn weighted_mean<A, S>(data: &ArrayBase<S, Ix1>, weights: &[A]) -> A
where
    S: Data<Elem = A>,
    A: Float,
{
    data.iter().zip(weights).fold(A::zero(), |acc, (&d, &w)| acc + d * w)
}

but I realize that it's too simple. This code is only useful for 1D arrays, or flattened matrices/images, etc. I can change the Ix1 for a D: Dimension, so that we don't need to flatten anything. It's still a one-liner though and it doesn't offer any "axis" feature, like Numpy. I think we need 2 functions here, because they won't return the same type.

n-d data with n-d weight, returns a number
axis mode: n-d data with n-d weight, returns (n-1)-d array.

What do you guys had in mind?

LukeMathWalker commented 5 years ago

I think we need 2 functions here, because they won't return the same type.

n-d data with n-d weight, returns a number

axis mode: n-d data with n-d weight, returns (n-1)-d array.

What do you guys had in mind?

I think it makes perfect sense to have two functions. @nilgoyette It is also consistent with the rest of the API: we have mean and mean_axis, var and var_axis, etc. :+1:

aeroaks commented 5 years ago

I was thinking of picking up the histogram merge method. I am relatively new to rust and ndarray. With this exercise, I want to pick up ndarray, rust and also start contributing to ndarray-* libraries. What do you guys think? Or are there more high level good first issue in other ndarray-* libraries?

munckymagik commented 5 years ago

@aeroaks I'd say go for it 💯

You could raise a draft PR if you get something working and want some early feedback.

RolfStierle commented 4 years ago

I would like to implement something like scipy.stats.binned_statistic_dd based on ndarray_stats::histogram::Histogram, allowing to caluclate running means, variances, sums, max/min value in each bin. Would that be of interest?

LukeMathWalker commented 4 years ago

It does! @RolfStierle

lebensterben commented 4 years ago

According to https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges

Some of bins building strategies are not implemented by rust-ndarray now:

doane
scott
stone

humphreylee commented 1 year ago

Thanks very much for sharing the good work. Would it be possible to add univariate, bivariate and multivariate kernel density estimation functions? Thanks.

rust-ndarray / ndarray-stats

Roadmap #1