haskell / statistics

A fast, high quality library for computing with statistics in Haskell.
http://hackage.haskell.org/package/statistics
BSD 2-Clause "Simplified" License
299 stars 67 forks source link

Suboptimal bandwidth estimation in Statistics.KernelDensity #4

Closed Shimuuar closed 13 years ago

Shimuuar commented 13 years ago

Function bandwidth use naive method of bandwidth estimation which provides reasonable results for unimodal distributions but systematically oversmoothes for distributions which significantly differ from normal especially bimodal ones. There are number of better performing method. Overview is given in paper "Brief survey of bandwidth selection for density estimation"

Attachments illustrate point. Black line is true distribution and red one is kernel density estimate.

P.S. Personally I don't need this at moment.

Migrated from bitbucket: https://bitbucket.org/bos/statistics/issue/3

bos commented 13 years ago

I'll implement that technique when I have time, as I could benefit from it.

Shimuuar commented 13 years ago

There is another use case which is not covered so far. If distribution have finite support (for example cosine must lie in [-1,1] range) then estimated PDF will spill some probability over the boundary. One way to deal with this situation is to reflect data around boundary then spilled probability will be compensated by spill from the other side. [1] discuss this briefly in §1.2.3

[1] http://arxiv.org/abs/hep-ex/0011057

Also it might be a good idea to made kernel a type class. Estimates require kernel function, its variance and more. Since all these functions and values are monomorphic they have to be wrapped in newtypes.

newtype KDE kernel a = KDE a
class Kernel k where
  kernel :: KDE (Double → Double)
  kernelVariance :: KDE Double
  ...

Thus kernel-specific calculations could be separated from algorithms. And just as important it could remove this paragraph. There are good chances that programmer will either skip it entirely or forget about it.

If you are using a Gaussian kernel, multiply the sample's bandwidth
by 3 before passing it to this function.
bos commented 13 years ago

I've implemented Botev's algorithm, and replaced the old KDE code in c9c3531a553733ca257b5ca1bebd205ee65a4a12.