statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.19k stars 2.97k forks source link

REF: Kernel density estimation cleanup and improvement #4220

Open tommyod opened 6 years ago

tommyod commented 6 years ago

I would like to help cleaning up the code related to kernel density estimation, and if possible also improve it. Is this of any interest, @josef-pkt ? If so, some help getting started would be great!

General observations

Some general observations.

Suggested improvements

I would like to help improve the quality of code associated with KDE. My ideas are, in order of least-to-most ambitious:

Questions and advice

I feel like I should add a few sentences about myself. I've been programming in Python for 3-4 years, have a degree in math, and I am interested in learning more about statistics/numerics/programming through a project of suitable difficulty. Contributing to statsmodels seems like a great place to start. I hope you will be positive to my ideas and suggestions, and would appreciate any feedback on the above.

-Tommy

josef-pkt commented 6 years ago

The kernel stuff is difficult. I lost already two contributors because I wasn't able to keep up or decide on design and backwards compatiblity decisions.

Essentially, attempts to refactor the current code needs considerable backwards compatibility breaks, and the current design is not very good for providing slow exact and fast approximate/interpolating results. (I moved largely away from trying to squeeze everything into a single class.)

If you like to work on this besides "cosmetic" changes, then I would suggest starting with #2318. It can be merged after a rebase and brief checking for 0.10 (after 0.9) branching. It would get an experimental label for maybe two years, and it's possible to use this time to incorporate changes for things that don't work well enough. We wouldn't have to look out for backwards compatibility with existing code during this time.

2318 is a huge improvement over the current version, except for maybe MultivariateKDE, and has many enhancements, new methods, more fast paths, ...

It would be great if you could pick this up, because it has been a lingering sore point for quite some time.

The new code does not cover kernel regression. I started some separate work on using binning similar to the fft version for kde also in kernel regression. #3492 (I ended up at that PR because of some partially related work variance and variance function estimation.)