I would like to help cleaning up the code related to kernel density estimation, and if possible also improve it. Is this of any interest, @josef-pkt ? If so, some help getting started would be great!

General observations

Some general observations.

The statsmodels/nonparametric package has not gotten much attention recently.
There is a statsmodels/sandbox/nonparametric package too, it has also received little attention.
According to this 2013 blog post by Vanderplas, the statsmodels KDE's are slower than scipy/sklearn.
No popular/maintained Python package appears to implement variable kernel density estimation yet. This seems like a great opportunity.
The general quality of the code in statsmodels/nonparametric is mixed. Some modules appear OK, but some appear unused. There are many PEP8 violations. There are 22 TODOs in the code which might indicate starting points for improvements (11 in statsmodels/sandbox/nonparametric too).

Suggested improvements

I would like to help improve the quality of code associated with KDE. My ideas are, in order of least-to-most ambitious:

Remove files which are not used (it has been years, if it has not been completed by now, it probably never will be completed).
Read over the files that are in use. Conform to PEP8, write more comments. Re-factor the code if needed. Re-factor tests if needed.
Increase speed so that we match the sklearn/scipy implementations.
Read up on literature, implement more advanced KDE algorithms (e.g. variable bandwidth).

Questions and advice

General thoughts and comments on the above is greatly appreciated.
The main question: How positive are you to these suggested changes in general, @josef-pkt ? (or others!)
Is deleting unused files OK? (especially in the sandbox)
If I start implementing some algorithms/code, what is the threshold for uploading my work to the sandbox?

I feel like I should add a few sentences about myself. I've been programming in Python for 3-4 years, have a degree in math, and I am interested in learning more about statistics/numerics/programming through a project of suitable difficulty. Contributing to statsmodels seems like a great place to start. I hope you will be positive to my ideas and suggestions, and would appreciate any feedback on the above.

-Tommy

The kernel stuff is difficult. I lost already two contributors because I wasn't able to keep up or decide on design and backwards compatiblity decisions.

Essentially, attempts to refactor the current code needs considerable backwards compatibility breaks, and the current design is not very good for providing slow exact and fast approximate/interpolating results. (I moved largely away from trying to squeeze everything into a single class.)

If you like to work on this besides "cosmetic" changes, then I would suggest starting with #2318. It can be merged after a rebase and brief checking for 0.10 (after 0.9) branching. It would get an experimental label for maybe two years, and it's possible to use this time to incorporate changes for things that don't work well enough. We wouldn't have to look out for backwards compatibility with existing code during this time.

2318 is a huge improvement over the current version, except for maybe MultivariateKDE, and has many enhancements, new methods, more fast paths, ...

It would be great if you could pick this up, because it has been a lingering sore point for quite some time.

The new code does not cover kernel regression. I started some separate work on using binning similar to the fft version for kde also in kernel regression. #3492 (I ended up at that PR because of some partially related work variance and variance function estimation.)

statsmodels / statsmodels

REF: Kernel density estimation cleanup and improvement #4220

General observations

Suggested improvements

Questions and advice

2318 is a huge improvement over the current version, except for maybe MultivariateKDE, and has many enhancements, new methods, more fast paths, ...