openturns / openturns

Uncertainty treatment library
http://openturns.github.io/openturns/latest/index.html
Other
237 stars 92 forks source link

The Kernel Smoothing could be improved. #1480

Open mbaudin47 opened 4 years ago

mbaudin47 commented 4 years ago

http://openturns.github.io/openturns/master/theory/data_analysis/kernel_smoothing.html

This is the algorithm from [^Sheather1991]. The algorithm is described in [^WandJones1994], page 74 with the "Solve the equation rule" name. The implementation uses ideas from [^Raykar2006], but the fast selection is not implemented.

To see this, I implemented Silverman's rule and its robust variant. We better see the difference with a very small sample, n=10 generated from a Normal distribution in dimension d=1.

This might not make a big difference in practice. However, the implementation does not fit to the name. Moreover, the theory page presents this method for dimension d=1, letting the user think that this method is used for d=1. One option is to name the method computeBandwidth which would better fit the content. Another option would be to implement the special case d=1.

[^WandJones1994]: M.P. Wand and M.C. Jones. Kernel Smoothing. Chapman and Hall/CRC, 1994 [^Sheather1991]: S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 53(3) :683–690, 1991.

[^Raykar2006]: Vikas Chandrakant Raykar, Ramani Duraiswami, "Very Fast optimal bandwidth selection for univariate kernel density estimation" (2006)

mbaudin47 commented 4 years ago

The doc part is managed in PR #1484. The implementation part remains to be done.

regislebrun commented 4 years ago

Thanks for the careful reading of the doc. Some complements are in order here:

If a LGPL-compatible implementation of the fast method is available it could be integrated to OT.

mbaudin47 commented 4 years ago

I do not know if this is LGPL, but it claims to be faster:

https://kdepy.readthedocs.io/en/latest/comparison.html

regislebrun commented 4 years ago

Thanks for the link. The implementation based on FFT needs to work with data points distributed on a lattice, which involves a binning procedure: you replace your unevenly distributed points with uniform weights by evenly distributed points with nonuniform weights, and this step is already tricky to perform, not only because you have to bin points in dimension d (which is already implemented in OT for d=1 or 2) but also because you have to control the error resulting from this binning, which is precisely one of the advantages claimed by the authors of "Very Fast optimal bandwidth selection for univariate kernel density estimation". In addition to the speed, the comparison should also include an accuracy measure wrt the naive procedure, which was done graphically here but not in the new comparison. But I agree with you: there must be a way to accelerate things here, in many directions (bandwidth selection, evaluation at a unique point, evaluation over a set of points, over the nodes of a regular grid...)

mbaudin47 commented 4 years ago

I have many doubts at the criterias used in the benchmark:

The "number of kernels" criteria does not seem very important in practice, even if is detailed in each benchmark I read. With a MISE difference never larger than 20% depending on the kernel, this sounds like analysing a detail. However, some implementations allow only 1 kernel while OpenTURNS provide them all (since getRoughness is now implemented)- a fact which is rarely mentioned.

Looking for speed might be important in some cases (e.g. computing minimum volume level sets of a kernel). However, this might be overlooking other interesting topics :

Confidence intervals and bands are presented in : Müller M. XploRe — "Learning Guide". Springer, Berlin, Heidelberg, 2000, chapter 6, page 181. This allows to compute e.g. a pointwise confidence interval. Another linked topic is kernel regression.

mbaudin47 commented 4 years ago

I added OT in the benchmark:

https://nbviewer.jupyter.org/github/mbaudin47/KDEpy/blob/BenchmarkAddOT/docs/presentation/figs/profiling.ipynb

The results are interesting:

image

When the binning enters, the timing stops to increase, then increases again for really large sample sizes. I suggested to the author to include OT in the bench: https://github.com/tommyod/KDEpy/pull/52