Open mbaudin47 opened 4 years ago
The doc part is managed in PR #1484. The implementation part remains to be done.
Thanks for the careful reading of the doc. Some complements are in order here:
If a LGPL-compatible implementation of the fast method is available it could be integrated to OT.
I do not know if this is LGPL, but it claims to be faster:
Thanks for the link. The implementation based on FFT needs to work with data points distributed on a lattice, which involves a binning procedure: you replace your unevenly distributed points with uniform weights by evenly distributed points with nonuniform weights, and this step is already tricky to perform, not only because you have to bin points in dimension d (which is already implemented in OT for d=1 or 2) but also because you have to control the error resulting from this binning, which is precisely one of the advantages claimed by the authors of "Very Fast optimal bandwidth selection for univariate kernel density estimation". In addition to the speed, the comparison should also include an accuracy measure wrt the naive procedure, which was done graphically here but not in the new comparison. But I agree with you: there must be a way to accelerate things here, in many directions (bandwidth selection, evaluation at a unique point, evaluation over a set of points, over the nodes of a regular grid...)
I have many doubts at the criterias used in the benchmark:
The "number of kernels" criteria does not seem very important in practice, even if is detailed in each benchmark I read. With a MISE difference never larger than 20% depending on the kernel, this sounds like analysing a detail. However, some implementations allow only 1 kernel while OpenTURNS provide them all (since getRoughness is now implemented)- a fact which is rarely mentioned.
Looking for speed might be important in some cases (e.g. computing minimum volume level sets of a kernel). However, this might be overlooking other interesting topics :
Confidence intervals and bands are presented in : Müller M. XploRe — "Learning Guide". Springer, Berlin, Heidelberg, 2000, chapter 6, page 181. This allows to compute e.g. a pointwise confidence interval. Another linked topic is kernel regression.
I added OT in the benchmark:
The results are interesting:
When the binning enters, the timing stops to increase, then increases again for really large sample sizes. I suggested to the author to include OT in the bench: https://github.com/tommyod/KDEpy/pull/52
computePluginBandwidth
algorithm:http://openturns.github.io/openturns/master/theory/data_analysis/kernel_smoothing.html
This is the algorithm from [^Sheather1991]. The algorithm is described in [^WandJones1994], page 74 with the "Solve the equation rule" name. The implementation uses ideas from [^Raykar2006], but the fast selection is not implemented.
computeSilvermanBandwidth
does not implement Silverman's rule.To see this, I implemented Silverman's rule and its robust variant. We better see the difference with a very small sample, n=10 generated from a Normal distribution in dimension d=1.
This might not make a big difference in practice. However, the implementation does not fit to the name. Moreover, the theory page presents this method for dimension d=1, letting the user think that this method is used for d=1. One option is to name the method
computeBandwidth
which would better fit the content. Another option would be to implement the special case d=1.computeSilvermanBandwidth
uses a robust implementation of the bandwidth, always. This does not match the histogram counterpart, where the robust flag (which is enabled by default) can be disabled.[^WandJones1994]: M.P. Wand and M.C. Jones. Kernel Smoothing. Chapman and Hall/CRC, 1994 [^Sheather1991]: S. J. Sheather and M. C. Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society. Series B (Methodological), 53(3) :683–690, 1991.
[^Raykar2006]: Vikas Chandrakant Raykar, Ramani Duraiswami, "Very Fast optimal bandwidth selection for univariate kernel density estimation" (2006)