tommyod / KDEpy

Kernel Density Estimation in Python
https://kdepy.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
576 stars 90 forks source link

Add a new rule of thumb #161

Open Expertium opened 8 months ago

Expertium commented 8 months ago

There is a rule of thumb which should, in theory, perform better than Silverman's rule. Here is the relevant paper: https://www.hindawi.com/journals/jps/2015/242683/ And here's my simple Python implementation for one-dimensional data:

def chens_rule(data):
    std = np.std(data)
    IQR = (np.percentile(data, q=75) - np.percentile(data, q=25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.mean(data)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1 / (4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1 / 5)) * scale * (n ** (-2 / 5))
        return h
    else:
        raise Exception("Chen's rule failed")

Note that I added two changes compared to the original paper: 1) The estimate of scale is not exactly the same as the standard deviation: I changed it to make it more robust, similar to the Silverman's rule 1) I added a sample size correction to the coefficient of variation. However, it's only appropriate for normally distributed data, so I'm not entirely sure whether it should be used

tommyod commented 8 months ago

Hi, thanks for bringing this to my attention.

However, if you want this to go into KDEpy you will have a write and PR for it. I'm open to it if it seems to perform well on data. I would want to see it against Silverman on a plot.

Expertium commented 8 months ago

Unfortunately, I'm not that good at coding, making a fully fleshed out version of this function is beyond my abilities. The best I can do is add support for weighting, at least for one-dimensional data:

def chen_rule(data, weights=None):
    # https://www.hindawi.com/journals/jps/2015/242683/
    if weights is None:
        weights = np.ones_like(data)

    def weighted_percentile(data, weights, q):
        # q must be between 0 and 1
        ix = np.argsort(data)
        data = data[ix]  # sort data
        weights = weights[ix]  # sort weights
        C = 1
        cdf = (np.cumsum(weights) - C * weights) / (np.sum(weights) + (1 - 2 * C) * weights)  # 'like' a CDF function
        return np.interp(q, cdf, data)  # when all weights are equal to 1, this is equivalent to using 'linear' in np.percentile

    std = np.sqrt(np.cov(data, aweights=weights))
    IQR = (weighted_percentile(data, weights, 0.75) - weighted_percentile(data, weights, 0.25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.average(data, weights=weights)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1/(4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1/5)) * scale * (n ** (-2/5))
        return h
    else:
        raise Exception("Chen's rule failed")

As for plotting, here's a comparison using the standard normal distribution: Figure_1

Just by eyeballing it, Chen seems to be a little closer to the "true" distribution than Silverman, although maybe I just got lucky with this sample. Improved Sheather-Jones doesn't perform well: the line is very "squiggly" and not smooth. The issue with Chen's rule is that cv = (1 + 1/(4 * n)) * scale / mean approaches infinity as the the mean approaches 0. However, I think it's more interesting to compare them on a distribution that is heavily skewed: Figure_2

Chen is in between Silverman and ISJ in terms of smoothness. I think this part demonstrates it especially well: image Silverman (orange) is the smoothest, Chen (blue) is more squiggly, and ISJ (green) is the most squiggly one. Obviously, this isn't a rigorous analysis.