tommyod / KDEpy

Kernel Density Estimation in Python
https://kdepy.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
584 stars 90 forks source link

Added weighting of silverman and scott #77

Open tommyod opened 4 years ago

tommyod commented 4 years ago

Thanks for the comments @lukedyer-peak .

This was not as straightforward as I first thought. If you have any more thoughts let me know.

lukedyer-peak commented 3 years ago
  • The standard deviation is computed using ddof = 1, i.e. the sample standard deviation with n-1 in the denominator. With weights my immediate generalization was sum(weights)-1, but often the weights sum to unity. I'm considering scaling weights so the smallest weight equals one, this way the sample standard deviation substracts the smallest weight. But I don't think that's a common way of doing it.

I think it would be helpful to define what is meant by the weights. I'm not a statistical expert but there are 2 different ways weights meaning weights can have here. I think restricting to one case or another might help - and documenting what is meant these weights would be useful too. Wiki describes 2 different ways of calculating a weighted std dev with either frequency or reliability weights (note in some formula on that wiki page they assume that the weights have been normalised so that they sum to 1). I personally think it might be best to go with the reliability weights, which GNU also go with in their science library. In some places reliability weights are just talked of as weights and frequency weights as frequency - see this explanation in a SAS blog.

  • Weighted percentiles were also non-trivial. I found some code snippets online, but none that were very good. Many failed the property that repeated observations should equal integer weights, i.e. that data = [0, 1, 1] should equal data = [0, 1] with weights = [1, 2].

I think this logic (of using reliability weights) should follow through naturally to calculating quantiles. One could think of sampling with these weights and taking quantiles from the sampled distributions. Then if you follow that logic through it would lead to something like this code snipped from SO.

  • Weights should probably not be allowed to be zero (which is equal to data not being there in the first place). This choice should be consistent, but it's most important in the first check of weights. (Many sub-routines also check weights, just for sanity).

I have some personal motivation to allow 0 weighting, which would correspond to ignoring that observation. This is as I'm planning on using this package. (I can implement this logic on my side though). There evidence for this approach being "standard" or "expected" too as numpy allows weights to be 0 (and probabilities to be 0 in the random module).