rust-ndarray / ndarray-stats

Statistical routines for ndarray
https://docs.rs/ndarray-stats
Apache License 2.0
201 stars 25 forks source link

Weighted mean #51

Closed nilgoyette closed 5 years ago

nilgoyette commented 5 years ago

Here's a first version of weighted_mean and weighted_mean_axis.

Disclaimers:

  1. I don't really know where weighted_mean and friends should go. Is summary_statistics ok?
  2. I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.
  3. There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

Questions:

  1. Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?
fmorency commented 5 years ago

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

LukeMathWalker commented 5 years ago
  1. I don't really know where weighted_mean and friends should go. Is summary_statistics ok?

I'd say it makes sense for them to go there, given that it's where mean is.

  1. I had to move return_err_if_empty and return_err_unless_same_shape because ther are useful elsewhere.

It makes perfect sense.

  1. There's a little code-copy in weighted_mean_axis, to avoid (2 conditions + 1 unwrap) x nb_lanes. Maybe I could create an inner function called inner_weighted_mean or something, then call it in both functions?

I don't feel too strongly in either direction - if you feel like doing it, all the better :+1:

Questions:

  1. Why are the summary_statistics tests (and others) not in /tests/*? I thought that the public API was supposed to be tested outside the crate. Is this not a "standard"?

I'd say that it's standard for integration tests to go outside the crate, yes. On the other side, it's sometimes nice to have tests next to the code they refer to - we haven't been very consistent across the crate. It would probably makes sense to migrate them to the tests folder.

Overall, this looks good to me, thanks for working on it! I suggested one area of improvement around testing that would increase the robustness of our current check - let me know what you think about it @nilgoyette.

LukeMathWalker commented 5 years ago

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary. I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

fmorency commented 5 years ago

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary. I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

Thoughts?

LukeMathWalker commented 5 years ago

I think we should mimic the numpy behavior here and not assume the weights are pre-normalized. I think we should divide by the weight sum inside the weighted_mean method.

I am not sure - I am concerned about introducing rounding errors doing a normalization step that might or might not be necessary. I would lean on the side of being explicit and letting the user taking care of normalisation, if they need it.

I see your point and I tend to agree.

However, my concern is that the weighted_mean function doesn't return an actual weighted mean, except then the weights are normalized. From the definition in the documentation, I would tend to

  • Rename this function to weighted_sum and note in the documentation that if the weights are pre-normalized, the result is a weighted mean
  • Write an additional function weighted_mean that actually computes the weighted mean given any input - ie. divide by the sum of the weights.

Thoughts?

I think that makes perfect sense :+1:

nilgoyette commented 5 years ago

I think it's better now. I had to remove or if the length of the axis is zero and division by zero panics for type A. on weighted_mean_axis because the MultiInputError::EmptyInput error happens before any panic. Imo it's better that way.

Can you please check the two first assert_eq! of weighted_sum_dimension_zero? As I wrote, [0, N] arrays don't make sense to me so I'm not sure what's the right output.