The Kernel Smoothing could be improved.

mbaudin47 commented 4 years ago

The theory help page may mention the correct reference for the computePluginBandwidth algorithm:

http://openturns.github.io/openturns/master/theory/data_analysis/kernel_smoothing.html

This is the algorithm from [^Sheather1991]. The algorithm is described in [^WandJones1994], page 74 with the "Solve the equation rule" name. The implementation uses ideas from [^Raykar2006], but the fast selection is not implemented.

The method computeSilvermanBandwidth does not implement Silverman's rule.

To see this, I implemented Silverman's rule and its robust variant. We better see the difference with a very small sample, n=10 generated from a Normal distribution in dimension d=1.

computeSilvermanBandwidth: 0.5841
Silverman's rule: 0.7680
Silverman's rule, with a robust implementation of the standard deviation: 0.6188

This might not make a big difference in practice. However, the implementation does not fit to the name. Moreover, the theory page presents this method for dimension d=1, letting the user think that this method is used for d=1. One option is to name the method computeBandwidth which would better fit the content. Another option would be to implement the special case d=1.

The method computeSilvermanBandwidth uses a robust implementation of the bandwidth, always. This does not match the histogram counterpart, where the robust flag (which is enabled by default) can be disabled.

[^WandJones1994]: M.P. Wand and M.C. Jones. Kernel Smoothing. Chapman and Hall/CRC, 1994 [^Sheather1991]: S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 53(3) :683–690, 1991.

[^Raykar2006]: Vikas Chandrakant Raykar, Ramani Duraiswami, "Very Fast optimal bandwidth selection for univariate kernel density estimation" (2006)

mbaudin47 commented 4 years ago

The doc part is managed in PR #1484. The implementation part remains to be done.

regislebrun commented 4 years ago

Thanks for the careful reading of the doc. Some complements are in order here:

From a practical point of view, there is no actual reason to stick to the optimal or exact formulas for the AMISE criterion. This criterion is an approximation of an averaged L2 error, valid in the limit of h->0, which is very doubtful even for large samples (h of order 0.1 for a sample of size 10^5 drawn from the standard normal distribution). So replacing the (4/3)^{1/5}~1.06 factor by 1 is not a big deal when applying Silverman's rule (but I agree on the fact that it is an approximation and no more the exact Silverman rule)
For the other kernels, the code implements Scott's recommendation in Section 6.2.3.3 (equivalent kernels) and in particular eq. 6.28 p. 142, where the ratio involving the roughness of the kernels is set to 1 as it varies only by some % among the popular kernels, once again based on the remark that a bandwidth can vary of 10% wrt the exact optimal AMISE bandwidth without noticeable changes in the quality of the density estimation.
The fast version of the plugin bandwidth looks very promising, but when I started to implement it I saw that it was quite complex to implement, with several tuning parameters to set for the clustering step and the Taylor expansion to control. Then I realized that we were only looking for a correcting factor wrt Silverman's rule, so I decided to implement the basic version of the plugin method (which is already quite complex) and to propose what is called the mixed bandwidth, which is nothing more than Silverman's rule computed using the whole sample, corrected by the ratio between the plugin bandwidth and Silverman's bandwidth computed using a small sample (the "KernelSmoothing-SmallSize" key in ResourceMap). This way we get a cheap and reasonably accurate corrective factor.

If a LGPL-compatible implementation of the fast method is available it could be integrated to OT.

mbaudin47 commented 4 years ago

I do not know if this is LGPL, but it claims to be faster:

https://kdepy.readthedocs.io/en/latest/comparison.html

regislebrun commented 4 years ago

Thanks for the link. The implementation based on FFT needs to work with data points distributed on a lattice, which involves a binning procedure: you replace your unevenly distributed points with uniform weights by evenly distributed points with nonuniform weights, and this step is already tricky to perform, not only because you have to bin points in dimension d (which is already implemented in OT for d=1 or 2) but also because you have to control the error resulting from this binning, which is precisely one of the advantages claimed by the authors of "Very Fast optimal bandwidth selection for univariate kernel density estimation". In addition to the speed, the comparison should also include an accuracy measure wrt the naive procedure, which was done graphically here but not in the new comparison. But I agree with you: there must be a way to accelerate things here, in many directions (bandwidth selection, evaluation at a unique point, evaluation over a set of points, over the nodes of a regular grid...)

mbaudin47 commented 4 years ago

I have many doubts at the criterias used in the benchmark:

Bandwidth Selection
Available Kernels
Multidimension
Heterogeneous data
FFT-based computation
Tree-based computation

The "number of kernels" criteria does not seem very important in practice, even if is detailed in each benchmark I read. With a MISE difference never larger than 20% depending on the kernel, this sounds like analysing a detail. However, some implementations allow only 1 kernel while OpenTURNS provide them all (since getRoughness is now implemented)- a fact which is rarely mentioned.

Looking for speed might be important in some cases (e.g. computing minimum volume level sets of a kernel). However, this might be overlooking other interesting topics :

computing anything from the distribution : PDF, CDF, quantile, etc... (which is in OT),
bounds (which is implemented in OpenTURNS) - very important,
mixture of distributions (this is different from the "Heterogeneous data" criterion, which distinguishes between continuous and discrete marginals),
confidence intervals (which is not in OT) - of course.

Confidence intervals and bands are presented in : Müller M. XploRe — "Learning Guide". Springer, Berlin, Heidelberg, 2000, chapter 6, page 181. This allows to compute e.g. a pointwise confidence interval. Another linked topic is kernel regression.

mbaudin47 commented 4 years ago

I added OT in the benchmark:

https://nbviewer.jupyter.org/github/mbaudin47/KDEpy/blob/BenchmarkAddOT/docs/presentation/figs/profiling.ipynb

The results are interesting:

When the binning enters, the timing stops to increase, then increases again for really large sample sizes. I suggested to the author to include OT in the bench: https://github.com/tommyod/KDEpy/pull/52

openturns / openturns

The Kernel Smoothing could be improved. #1480