Open LukeMathWalker opened 6 years ago
With respect to
mean
,average
,std
:var
is implemented in the main ndarray crate - would it make sense to port it here?
I think it makes sense for ndarray-stats
to provide *_skipnan
variants (or whatever you want to call them) of those methods. However, it would make sense to add std_axis
to ndarray
since ndarray
already has var_axis
.
For methods that are already in ndarray
, we could duplicate these methods as a trait in ndarray-stats
for people who want to write generic code (where the implementations just call the instance methods). I'm ambivalent on this.
I'll slowly start working on this next week and then I'll get serious the week afterwards. Could you please give me commit/PR permissions to the repository @jturner314?
Okay, that sounds good. I've given you push access. Alternatively, if you'd like to have your repo be the main one instead of this one, that would be fine with me.
Once #9 gets merged I think we are in a good position to officially release version 0.1.0 on crates.io - what do you think? @jturner314
I agree.
By the way, I recently came across Julia's StatsBase.jl
library. It's a good source of ideas in addition to NumPy/SciPy.
Added a bunch of tests to #9 and merged 🎉 It feels like ages since I started to work on it :sweat_smile: Your contribution was extremely helpful to get it in the shape it is right now, thanks a lot @jturner314!
What do we need to do in order to release on crates.io? I am going to open a small PR to add crate-level documentation - a couple of lines, nothing major.
Yay! :tada: That was a big job; great work.
What do we need to do in order to release on crates.io?
Ideally, we'd eliminate the [patch.crates-io]
section from the Cargo.toml
before we can release on crates.io. (This might even be required, I'm not sure.) #11 removed the patch for noisy_float
, but a new version of ndarray
will need to be released for us to remove its patch. It would be nice to merge a couple more ndarray
PRs before release; I'll take a look.
It would also be good to merge #12 and #13 before releasing.
Merged #12 and #13 - looking around it seems we can publish with [patch.crates-io]
section in Cargo.toml
, but I agree it is much nicer to point to ndarray
0.12.1 as a dependency instead of a revision on master
.
Let's wait for that release and then we are good to go.
ndarray-stats
0.1.0 is now on crates.io. :tada: Thanks for all your hard work @LukeMathWalker!
💯 💯 I think it's safe to say it would have never got there without your help 😛 I'll drop a post on r/rust as well 👍
I have drafted a tentative roadmap with the features I'd like to add in the next release - please edit it with your comments and suggestions @jturner314
The roadmap looks good to me. I'm not familiar with the applications of higher order central moments (I'd usually use a histogram instead), but I don't mind adding them if people find them useful.
By the way, I invited you as an owner for the ndarray-stats
crate, but I just realized that crates.io may not have sent the invitation if you haven't logged in before. Please let me know if you need me to re-send it.
Somehow I didn't receive an email notification, but the invite was on my dashboard - accepted it!
The main objective in that area is getting kurtosis and skewness, and given the kind of computation required to achieve that it makes sense to also roll out higher order central moments I'd say :)
Hey mate, argmin / argmax
looks like simple enough to look into, do you have any suggestions of where to start?
Thanks for your interest! You'll want to add argmin
and argmax
methods to the QuantileExt
trait and implement them. Please include documentation for the methods and some tests (in tests/quantile.rs
).
I'd suggest starting with the existing implementation for min
as a basis, but using .indexed_iter().fold()
or .indexed_iter().try_fold()
instead of .fold()
.
It would also be good to add argmin_skipnan
and argmax_skipnan
methods (analogous to min_skipnan
and max_skipnan
, but that's not necessary for the first PR.
Please feel free to ask if you have any questions.
Hey mates, I have added argmin_skipnan
and argmax_skipnan
, wonder why you use PartialOrd
for min
, but Ord
for min_skipnan
?
And what does this mean by this? partialord version for quantiles
It's because we require the data type to be MaybeNan
: it basically means that, apart from a subset of elements (e.g. NaN
for floats), we are dealing with a data type that is totally ordered (all pairs of elements can be compared, Ord
).
This reduces the failure scope:
min
can return None
is a comparison fails (as it can happen, with PartialOrd
) or if there is no element in the array. min_skipnan
returns None
if and only if the array has no not-NaN element (because no comparison will be undefined).This can be useful when you are dealing with floats or arrays with potentially missing values (e.g. Option<A>
, where A: Ord
).
Re: quantiles - the current implementation requires A
to implement Ord
. We'd like to relax it to allow A
to be PartialOrd
instead of Ord
.
Thanks @LukeMathWalker for the last point, if we change A: Ord
to A: PartialOrd
and refactor the code + test to allow that change, it would complete the task right?
Exactly! @phungleson I'd suggest you to wait until #26 is merged before tackling this task, otherwise you are in for some nasty merge conflicts :stuck_out_tongue: I am almost there, I am just investigating some stack overflow errors in the revised version I have been writing.
Cool thanks @LukeMathWalker so seems like everything is more or less complete? Let me know if there are any doable features, cheers.
BTW merge method;
seems to be straight forward but do you have any thoughts yet about the implementation?
For merge
I read quickly, so basically just adding the weights?
for h in others
target.weights .+= h.weights
end
Yes @phungleson, it basically boils down to summing together the weight
matrices (plus or minus checking that their dimension/bins are compatible, I haven't looked into it). If you want to give it try, please go ahead!
I'd like to close existing work streams and cut a release - what does your bandwidth look like @jturner314 to review open PRs?
I've been meaning to look over the open PRs but haven't had a chance. I'll reserve time on Sunday to review them.
It seems I managed to publish 0.2.0
without making a mess :muscle:
Thanks @jturner314 @phungleson and @munckymagik for all the work done on this release :heart:
I'd say we have done a major leap forward in terms of features - there are things that can be polished, the API design can be further improved and we can optimize the existing code, but ndarray-stats
is definitely a viable solution right now :rocket:
I'll clean up the parent post to move items that we didn't manage to include in this release to the roadmap for the next one. I am not sure what we should be covering next in terms of major new functionality :thinking:
Well done all 👏
Great job on 0.2.0 everyone!
I am not sure what we should be covering next in terms of major new functionality
A couple of ideas from StatsBase.jl:
We could also add statistical models (e.g. linear regression), but that might be best put in a separate crate.
Well done! cheers!
A couple of ideas from StatsBase.jl:
- Deviation functions
- Weighted calculations (mean/std/etc.)
Unless any of you have made a start on these, I'd be interested in having a go at either, or contributing. I'll try to spend some time in the next couple of days looking at what is involved with the Deviation functions.
❓ Does anyone have any implementation suggestions other than just trying to port from StatsBase.jl?
If anyone wants to collaborate on the code then let me know.
Ok I made a start: https://github.com/jturner314/ndarray-stats/pull/41
Any advice for choosing traits bounds for the A
element types? Is it ok to use Copy
or do we need to support any types that would be Clone
?
I'd say to use clone
@munckymagik
@LukeMathWalker thanks. What led you to that decision? Is there a particular data type you've seen used in ndarray
s that would need this? If so I'm thinking I might use it in the test fixtures to make sure all methods have the same bounds.
I see it as a tradeoff between convenience and generality - I am not personally aware of any "popular" numerical type that is not Copy
, but the cost of weakening it to Clone
is so low that I see it as safe future-proofing @munckymagik
I wanted to code a simple weighted_mean
for myself then contribute it
pub fn weighted_mean<A, S>(data: &ArrayBase<S, Ix1>, weights: &[A]) -> A
where
S: Data<Elem = A>,
A: Float,
{
data.iter().zip(weights).fold(A::zero(), |acc, (&d, &w)| acc + d * w)
}
but I realize that it's too simple. This code is only useful for 1D arrays, or flattened matrices/images, etc. I can change the Ix1
for a D: Dimension
, so that we don't need to flatten anything. It's still a one-liner though and it doesn't offer any "axis" feature, like Numpy. I think we need 2 functions here, because they won't return the same type.
What do you guys had in mind?
I think we need 2 functions here, because they won't return the same type.
- n-d data with n-d weight, returns a number
- axis mode: n-d data with n-d weight, returns (n-1)-d array.
What do you guys had in mind?
I think it makes perfect sense to have two functions. @nilgoyette
It is also consistent with the rest of the API: we have mean
and mean_axis
, var
and var_axis
, etc. :+1:
I was thinking of picking up the histogram merge
method.
I am relatively new to rust and ndarray
. With this exercise, I want to pick up ndarray, rust and also start contributing to ndarray-*
libraries.
What do you guys think? Or are there more high level good first issue
in other ndarray-*
libraries?
@aeroaks I'd say go for it 💯
You could raise a draft PR if you get something working and want some early feedback.
I would like to implement something like scipy.stats.binned_statistic_dd based on ndarray_stats::histogram::Histogram, allowing to caluclate running means, variances, sums, max/min value in each bin. Would that be of interest?
It does! @RolfStierle
According to https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges
Some of bins building strategies are not implemented by rust-ndarray
now:
doane
scott
stone
Thanks very much for sharing the good work. Would it be possible to add univariate, bivariate and multivariate kernel density estimation functions? Thanks.
In terms of functionality, the mid-term end goal is to achieve feature parity with the statistics routine in
numpy
(here) andJulia StatsBase
(here).For the next version:
partialord
version forquantiles
methods;merge
method;For version 0.2.0:
For version 0.1.0:
max
/nanmax
(@jturner314)min
/nanmin
(@jturner314)quantile
/nanquantile
(it includespercentile
/nanpercentile
as a special case) (@LukeMathWalker & @jturner314)correlation
-methods:cov
(@LukeMathWalker) - ~One last fix to be made (#3)~ [On hold for now]corrcoef
(@LukeMathWalker - #5)histogram
-methods (@LukeMathWalker - #9)