Open Expertium opened 10 months ago
Hi, thanks for bringing this to my attention.
However, if you want this to go into KDEpy you will have a write and PR for it. I'm open to it if it seems to perform well on data. I would want to see it against Silverman on a plot.
Unfortunately, I'm not that good at coding, making a fully fleshed out version of this function is beyond my abilities. The best I can do is add support for weighting, at least for one-dimensional data:
def chen_rule(data, weights=None):
# https://www.hindawi.com/journals/jps/2015/242683/
if weights is None:
weights = np.ones_like(data)
def weighted_percentile(data, weights, q):
# q must be between 0 and 1
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
C = 1
cdf = (np.cumsum(weights) - C * weights) / (np.sum(weights) + (1 - 2 * C) * weights) # 'like' a CDF function
return np.interp(q, cdf, data) # when all weights are equal to 1, this is equivalent to using 'linear' in np.percentile
std = np.sqrt(np.cov(data, aweights=weights))
IQR = (weighted_percentile(data, weights, 0.75) - weighted_percentile(data, weights, 0.25)) / 1.3489795003921634
scale = min(IQR, std)
mean = np.average(data, weights=weights)
n = len(data)
if mean != 0 and scale > 0:
cv = (1 + 1/(4 * n)) * scale / mean # corrected for small sample size
h = ((4 * (2 + cv ** 2)) ** (1/5)) * scale * (n ** (-2/5))
return h
else:
raise Exception("Chen's rule failed")
As for plotting, here's a comparison using the standard normal distribution:
Just by eyeballing it, Chen seems to be a little closer to the "true" distribution than Silverman, although maybe I just got lucky with this sample. Improved Sheather-Jones doesn't perform well: the line is very "squiggly" and not smooth. The issue with Chen's rule is that cv = (1 + 1/(4 * n)) * scale / mean
approaches infinity as the the mean approaches 0.
However, I think it's more interesting to compare them on a distribution that is heavily skewed:
Chen is in between Silverman and ISJ in terms of smoothness. I think this part demonstrates it especially well: Silverman (orange) is the smoothest, Chen (blue) is more squiggly, and ISJ (green) is the most squiggly one. Obviously, this isn't a rigorous analysis.
There is a rule of thumb which should, in theory, perform better than Silverman's rule. Here is the relevant paper: https://www.hindawi.com/journals/jps/2015/242683/ And here's my simple Python implementation for one-dimensional data:
Note that I added two changes compared to the original paper: 1) The estimate of scale is not exactly the same as the standard deviation: I changed it to make it more robust, similar to the Silverman's rule 1) I added a sample size correction to the coefficient of variation. However, it's only appropriate for normally distributed data, so I'm not entirely sure whether it should be used