scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.51k stars 257 forks source link

Make boruta_py suitable for GridSearches #20

Closed MaxBenChrist closed 7 years ago

MaxBenChrist commented 7 years ago

When using boruta_py in a sklearn gridsearch, the error object has no attribute 'get_params' occurs. It would be interesting, if one could also optimize the parameter of the boruta feature selection

MaxBenChrist commented 7 years ago

To solve this, one needs to implement a .get_params() method

danielhomola commented 7 years ago

If you submit a PR for this I'm more than happy to accept it, but I'll be very busy in the coming months unfortunately.. Cheers

MaxBenChrist commented 7 years ago

okay lets have a deal, I will submit a pr for this and you upload it to pypi? ;)

danielhomola commented 7 years ago

Alright, deal :) (but only in the 2nd or 3rd week of Jan)

mbq commented 7 years ago

Sorry to interrupt, but this seem like a pretty bad idea. The point of Boruta is that it is an all relevant method, so it should be optimised for robust selection rather than for lowest post-selection error (what GridSearch does). I am afraid that adding such pursuit will simply degenerate the method into an incredibly inefficient random sampler.

danielhomola commented 7 years ago

Since Miron is the author of the Boruta algorithm, I'll trust him on this one. Unless you can convince him @MaxBenChrist :)

MaxBenChrist commented 7 years ago

First, merry Christmas to you all! :)

Hi @mbq, first, nice paper + algorithm. Regarding your doubts about using Boruta in a Grid Search:

Lets say we have a simple pipeline, first Boruta and then a classifier C. A Gridsearch over many different folds optimizes the parameters of Boruta for the best classifier performance C. Now you fear that Boruta will become a random sampler. I am not sure why.

How else should one determine the parameters (e.g. max_depth of random forest classifier, or number of estimators) of Boruta if not on the final classifier performance? On real world data sets, we don't know which features are relevant and which not. This is a general problem, if I have pipeline with a feature selection algorithm, I have to optimize all the parameters including those of the feature selection algorithm at the same time. I can't optimize the parameter of the feature selection step alone because there is no score / loss function for it.

mbq commented 7 years ago

Happy holidays!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.). That's why there are two classes of FS methods, "minimal optimal" and "all relevant" (nice paper about this, with Bayesian net definitions).

This is obviously true that optimising an all relevant method is basically next to hopeless. Still, you may aim at robustness (like stability of the selection under perturbations to the input set; though it is only an upper bound, there may be a perfectly stable method which selects junk), use some domain knowledge or get some generally robust methods and hope for the best. The latter is what Boruta does; for RF classification max_depth is infinity by design (sic, otherwise this is just some random CART ensemble), default m is rarely significantly suboptimal, finally RF is expected to converge with the number of trees, so overshooting n only costs time.

MaxBenChrist commented 7 years ago

@mbq The paper is great. I studied it in detail and learned a lot. Thank you for that!

The problem here is that you assume the classification error is minimal on the relevant set of features, which is false in general (because of redundant features, classifier characteristics, noise, overfitting, etc.).

Actually, I deploy complex machine learning pipelines that contain both "all relevant" and "minimal optimal" or heavy regularized classifiers together. I create a huge amount of features and then use multiple layers of filtering/regularization/feature selection.

When you say that

optimising an all relevant method is basically next to hopeless.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

mbq commented 7 years ago

Even then, I think your pipeline would benefit more from Boruta with some sane default params as a filter of irrelevant attributes than from Boruta with parameters tuned to yield best accuracy (because I think it would mostly degenerate Boruta into returning few most obvious features or even pure noise); but I may be wrong, with a such meta-meta approach anything is possible.

Are you referring to Corollary 14 from the Nilsson paper here? ( The all-relevant problem requires exhaustive subset search. )

Well, no, rather that all relevant does not optimise error, thus it is hard to assess how all relevant some selection really is. Also Nilsson et al. consider asymptotic perfect case when you have a near perfect conditional probability estimates -- the whole Boruta mess is motivated by the fact this is really hard to achieve in problems that need feature selection.

MaxBenChrist commented 7 years ago

So what is your final position on the .get_params() method? :) Will you merge a pr containing such a method (triggering a warning about the all-revelant issues when used).