Open tommyod opened 4 years ago
- The standard deviation is computed using
ddof = 1
, i.e. the sample standard deviation withn-1
in the denominator. With weights my immediate generalization wassum(weights)-1
, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.
I think it would be helpful to define what is meant by the weights. I'm not a statistical expert but there are 2 different ways weights meaning weights can have here. I think restricting to one case or another might help - and documenting what is meant these weights would be useful too. Wiki describes 2 different ways of calculating a weighted std dev with either frequency or reliability weights (note in some formula on that wiki page they assume that the weights have been normalised so that they sum to 1). I personally think it might be best to go with the reliability weights, which GNU also go with in their science library. In some places reliability weights are just talked of as weights and frequency weights as frequency - see this explanation in a SAS blog.
- Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that
data = [0, 1, 1]
should equaldata = [0, 1]
withweights = [1, 2]
.
I think this logic (of using reliability weights) should follow through naturally to calculating quantiles. One could think of sampling with these weights and taking quantiles from the sampled distributions. Then if you follow that logic through it would lead to something like this code snipped from SO.
- Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).
I have some personal motivation to allow 0 weighting, which would correspond to ignoring that observation. This is as I'm planning on using this package. (I can implement this logic on my side though). There evidence for this approach being "standard" or "expected" too as numpy
allows weights to be 0 (and probabilities to be 0
in the random module).
Thanks for the comments @lukedyer-peak .
This was not as straightforward as I first thought. If you have any more thoughts let me know.
ddof = 1
, i.e. the sample standard deviation withn-1
in the denominator. With weights my immediate generalization wassum(weights)-1
, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.data = [0, 1, 1]
should equaldata = [0, 1]
withweights = [1, 2]
.data = [0, 1, 1]
should equaldata = [0, 1]
withweights = [1, 2]
should apply to the entire KDEpy library. I don't see any other possible interpretation that makes sense.