rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.84k stars 856 forks source link

Feature request for sequential feature selector: random subsets at each step #47

Open sergeyf opened 8 years ago

sergeyf commented 8 years ago

Hello,

First of all: thanks for the great package! I have gotten a lot of good use out of it, especially the sequential feature selection.

SFS becomes problematic as the number of features d increases, since the complexity grows as O(d^2). I have found that one way to deal with this is to take a random subset of the remaining dimensions to check at each step instead of trying all of them. If the random subset has size k then the complexity goes down to O(dk).

Take an example of sequential forward selection with d=1000 and k=25.

During the first step, we can either try all 1000 univariate models or pick a random subset of 25 univariate models, and then take the best of them. It makes sense to try them all so as to start with a good baseline.

During the second step, instead of trying 999 bivariate models, we try only 25 of them.

Then 25 instead of 998 trivariate models. And so on until we have 25 left, at which point we revert to trying them all.

If you're interested in some empirical results, I wrote a blog post about this a while back: http://blog.explainmydata.com/2012/07/speeding-up-greedy-feature-selection.html

This would be a great feature to have!

rasbt commented 8 years ago

Hi, Sergey, thanks for the suggestion and I totally agree with you hear: SFS is darn expensive (O(n^2)) when it comes to high dimensional datasets. The random subsampling sounds like a neat idea, and it shouldn't be too hard to add it to the existing implementation. I am a bit busy at the moment, but I will add it to my todo list for when I am back from SciPy :)!

Btw. it reminds me a bit of Bergstra & Bengio's paper on Random hyperparameter search (Random Search for Hyper-Parameter Optimization; http://www.jmlr.org/papers/v13/bergstra12a.html). I.e., that 60 random points fall within the true maximum with 95% probability. Or if each draw has a 5% chance to be in the true-maximum interval, we have 1 - (1 - 0.05)^n > 0.95 and n >= 60.

sergeyf commented 8 years ago

Sounds great! I've been using this in my own mini-implementation of SFS and it seems to work well.