python / cpython

The Python programming language
https://www.python.org
Other
63.64k stars 30.49k forks source link

Add kernel density estimation to the statistics module #115532

Closed rhettinger closed 9 months ago

rhettinger commented 9 months ago

Feature

Proposal:

I propose promoting the KDE statistics recipe to be an actual part of the statistics module.
See: https://docs.python.org/3.13/library/statistics.html#kernel-density-estimation

My thought Is that it is an essential part of statistical reasoning about sample data See: https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf

Discussion of this feature:

This was discussed and reviewed with the module author who gave it his full support:

This looks really great, both as a concept and the code itself. It's a good showcase for match, and an important statistics function.

Thank you for working on this.

Working Draft:

from typing import Sequence, Callable
from bisect import bisect_left, bisect_right
from math import sqrt, exp, cos, pi, cosh

def kde(sample: Sequence[float],
        h: float,
        kernel_function: str='gauss',
    ) -> Callable[[float], float]:
    """Kernel Density Estimation.  Creates a continuous probability
    density function from discrete samples.

    The basic idea is to smooth the data using a kernel function
    to help draw inferences about a population from a sample.

    The degree of smoothing is controlled by the scaling parameter h
    which is called the bandwidth.  Smaller values emphasize local
    features while larger values give smoother results.

    Kernel functions that give some weight to every sample point:

       gauss or normal
       logistic
       sigmoid

    Kernel functions that only give weight to sample points within
    the bandwidth:

       rectangular or uniform
       triangular
       epanechnikov or parabolic
       biweight or quartic
       triweight
       cosine

    Example
    -------

    Given a sample of six data points, estimate the
    probablity density at evenly spaced points from -6 to 10:

        >>> sample = [-2.1, -1.3, -0.4, 1.9, 5.1, 6.2]
        >>> f_hat = kde(sample, h=1.5)
        >>> total = 0.0
        >>> for x in range(-6, 11):
        ...     density = f_hat(x)
        ...     total += density
        ...     plot = ' ' * int(density * 400) + 'x'
        ...     print(f'{x:2}: {density:.3f} {plot}')
        ...
        -6: 0.002 x
        -5: 0.009    x
        -4: 0.031             x
        -3: 0.070                             x
        -2: 0.111                                             x
        -1: 0.125                                                   x
         0: 0.110                                            x
         1: 0.086                                   x
         2: 0.068                            x
         3: 0.059                        x
         4: 0.066                           x
         5: 0.082                                 x
         6: 0.082                                 x
         7: 0.058                        x
         8: 0.028            x
         9: 0.009    x
        10: 0.002 x
        >>> round(total, 3)
        0.999

    References
    ----------

    Kernel density estimation and its application:
    https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf

    Kernel functions in common use:
    https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use

    Interactive graphical demonstration and exploration:
    https://demonstrations.wolfram.com/KernelDensityEstimation/

    """

    kernel: Callable[[float], float]
    support: float | None

    match kernel_function:

        case 'gauss' | 'normal':
            c = 1 / sqrt(2 * pi)
            kernel = lambda t: c * exp(-1/2 * t * t)
            support = None

        case 'logistic':
            # 1.0 / (exp(t) + 2.0 + exp(-t))
            kernel = lambda t: 1/2 / (1.0 + cosh(t))
            support = None

        case 'sigmoid':
            # (2/pi) / (exp(t) + exp(-t))
            c = 1 / pi
            kernel = lambda t: c / cosh(t)
            support = None

        case 'rectangular' | 'uniform':
            kernel = lambda t: 1/2
            support = 1.0

        case 'triangular':
            kernel = lambda t: 1.0 - abs(t)
            support = 1.0

        case 'epanechnikov' | 'parabolic':
            kernel = lambda t: 3/4 * (1.0 - t * t)
            support = 1.0

        case 'biweight' | 'quartic':
            kernel = lambda t: 15/16 * (1.0 - t * t) ** 2
            support = 1.0

        case 'triweight':
            kernel = lambda t: 35/32 * (1.0 - t * t) ** 3
            support = 1.0

        case 'cosine':
            c1 = pi / 4
            c2 = pi / 2
            kernel = lambda t: c1 * cos(c2 * t)
            support = 1.0

        case _:
            raise ValueError(f'Unknown kernel function: {kernel_function!r}')

    n = len(sample)

    if support is None:

        def pdf(x: float) -> float:
            return sum(kernel((x - x_i) / h) for x_i in sample) / (n * h)

    else:

        sample = sorted(sample)
        bandwidth = h * support

        def pdf(x: float) -> float:
            i = bisect_left(sample, x - bandwidth)
            j = bisect_right(sample, x + bandwidth)
            supported = sample[i : j]
            return sum(kernel((x - x_i) / h) for x_i in supported) / (n * h)

    return pdf

def _test() -> None:
    from statistics import NormalDist

    def kde_normal(sample: Sequence[float], h: float) -> Callable[[float], float]:
        "Create a continuous probability density function from a sample."
        # Smooth the sample with a normal distribution kernel scaled by h.
        kernel_h = NormalDist(0.0, h).pdf
        n = len(sample)
        def pdf(x: float) -> float:
            return sum(kernel_h(x - x_i) for x_i in sample) / n
        return pdf

    D = NormalDist(250, 50)
    data = D.samples(100)
    h = 30
    pd = D.pdf

    fg = kde(data, h, 'gauss')
    fg2 = kde_normal(data, h)
    fr = kde(data, h, 'rectangular')
    ft = kde(data, h, 'triangular')
    fb = kde(data, h, 'biweight')
    fe = kde(data, h, 'epanechnikov')
    fc = kde(data, h, 'cosine')
    fl = kde(data, h, 'logistic')
    fs = kde(data, h, 'sigmoid')

    def show(x: float) -> None:
        for func in (pd, fg, fg2, fr, ft, fb, fe, fc, fl, fs):
            print(func(x))
        print()

    show(197)
    show(255)
    show(320)

    def integrate(func: Callable[[float], float],
                  low: float, high: float, steps: int=1000) -> float:
        width = (high - low) / steps
        xarr = (low + width * i for i in range(steps))
        return sum(map(func, xarr)) * width

    print('\nIntegrals:')
    for func in (pd, fg, fg2, fr, ft, fb, fe, fc, fl, fs):
        print(integrate(func, 0, 500, steps=10_000))

if __name__ == '__main__':

    import doctest

    _test()
    print(doctest.testmod())

Linked PRs

gpshead commented 9 months ago

This looks like a pretty clean useful addition. Nice draft!

Some things to consider (better explored in draft PR form than inline in an issue draft): kernel_function= being a str is perhaps unusual - could that instead be a direct reference to a Callable[float, float] that also carries a .support: float|None attribute? Those could be pre-defined by name perhaps within a statistics.kernel-ish namespace to avoid the match/case and on the fly definitions entirely. This allows people to also bring their own kernel rather than only use our predefined common ones.

rhettinger commented 9 months ago

Thanks for the compliment on the draft. I've been iterating on it for a few weeks.

FWIW, having a string kernel name is consistent with the API style for quantiles() and correlation() elsewhere in the module. I also like having kde() function fully self-contained, making it easier to use. The listed methods are consistently the only ones mentioned in KDE literature and SciPy only implements the normal kernel since it dominates the others in practice. So I think supporting custom kernels would be an over-generalization beyond what people actually use.

Side note: While the code volume makes it look like the kernel is the most important part, the referenced paper says that changing kernels only makes minor differences in the output. The significant parameter by far is h which has dramatic effects on the final result.