tommyod / KDEpy

Kernel Density Estimation in Python
https://kdepy.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
584 stars 90 forks source link

relationship to sklearn and scipy bandwidth parameters #108

Closed chucklesoclock closed 2 years ago

chucklesoclock commented 2 years ago

Hello!

What an efficient and useful library you have here! I was looking through the code and must admit I was defeated by this question:

What is the relationship between your calculated kde.bw bandwidth value and scikit-learn and scipy's? For example, scipy and sklearn is related in that the following invocations are equivalent (up to minor differences in implementation):

scipy.stats.gaussian_kde(x, bw_method=bandwidth / x.std(ddof=1))
sklearn.neighbors.KernelDensity(bandwidth=bandwidth, kernel='gaussian')

That is, scipy_bw = sklearn_bw / x.std(ddof=1). Do you have the relationship offhand? Otherwise I can do some experiments.

Thanks for all your work on the library! Especially Improved Sheather-Jones bandwidth selection, I'm not sure that exists elsewhere in Python.

tommyod commented 2 years ago

Glad you like the code. I hope it is helpful to you. In KDEpy, the bandwidth h is the standard deviation σ of the kernel function.

For instance, a KDE on a single data point at x=0 using Gaussian would give you a N(0, 1) distribution. But I haven't looked at sklearn and scipy in a while, so I forgot how they interpret bandwidth. My advice would be to check their implementations and test on some fake data to be sure you get the relationship right.