tommyod / KDEpy

Kernel Density Estimation in Python
https://kdepy.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
584 stars 90 forks source link

Add a new rule of thumb #161

Open Expertium opened 10 months ago

Expertium commented 10 months ago

There is a rule of thumb which should, in theory, perform better than Silverman's rule. Here is the relevant paper: https://www.hindawi.com/journals/jps/2015/242683/ And here's my simple Python implementation for one-dimensional data:

def chens_rule(data):
    std = np.std(data)
    IQR = (np.percentile(data, q=75) - np.percentile(data, q=25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.mean(data)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1 / (4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1 / 5)) * scale * (n ** (-2 / 5))
        return h
    else:
        raise Exception("Chen's rule failed")

Note that I added two changes compared to the original paper: 1) The estimate of scale is not exactly the same as the standard deviation: I changed it to make it more robust, similar to the Silverman's rule 1) I added a sample size correction to the coefficient of variation. However, it's only appropriate for normally distributed data, so I'm not entirely sure whether it should be used

tommyod commented 10 months ago

Hi, thanks for bringing this to my attention.

However, if you want this to go into KDEpy you will have a write and PR for it. I'm open to it if it seems to perform well on data. I would want to see it against Silverman on a plot.

Expertium commented 10 months ago

Unfortunately, I'm not that good at coding, making a fully fleshed out version of this function is beyond my abilities. The best I can do is add support for weighting, at least for one-dimensional data:

def chen_rule(data, weights=None):
    # https://www.hindawi.com/journals/jps/2015/242683/
    if weights is None:
        weights = np.ones_like(data)

    def weighted_percentile(data, weights, q):
        # q must be between 0 and 1
        ix = np.argsort(data)
        data = data[ix]  # sort data
        weights = weights[ix]  # sort weights
        C = 1
        cdf = (np.cumsum(weights) - C * weights) / (np.sum(weights) + (1 - 2 * C) * weights)  # 'like' a CDF function
        return np.interp(q, cdf, data)  # when all weights are equal to 1, this is equivalent to using 'linear' in np.percentile

    std = np.sqrt(np.cov(data, aweights=weights))
    IQR = (weighted_percentile(data, weights, 0.75) - weighted_percentile(data, weights, 0.25)) / 1.3489795003921634
    scale = min(IQR, std)
    mean = np.average(data, weights=weights)
    n = len(data)
    if mean != 0 and scale > 0:
        cv = (1 + 1/(4 * n)) * scale / mean  # corrected for small sample size
        h = ((4 * (2 + cv ** 2)) ** (1/5)) * scale * (n ** (-2/5))
        return h
    else:
        raise Exception("Chen's rule failed")

As for plotting, here's a comparison using the standard normal distribution: Figure_1

Just by eyeballing it, Chen seems to be a little closer to the "true" distribution than Silverman, although maybe I just got lucky with this sample. Improved Sheather-Jones doesn't perform well: the line is very "squiggly" and not smooth. The issue with Chen's rule is that cv = (1 + 1/(4 * n)) * scale / mean approaches infinity as the the mean approaches 0. However, I think it's more interesting to compare them on a distribution that is heavily skewed: Figure_2

Chen is in between Silverman and ISJ in terms of smoothness. I think this part demonstrates it especially well: image Silverman (orange) is the smoothest, Chen (blue) is more squiggly, and ISJ (green) is the most squiggly one. Obviously, this isn't a rigorous analysis.