scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.46k stars 252 forks source link

No n_features_to_select parameter #92

Open bgalvao opened 3 years ago

bgalvao commented 3 years ago

Although I understand that Boruta is, by design, an all-relevant feature selection method, it would be nice to have the option to select a specified number of features.

As of right now, BorutaPy presents ranking 1 through 3 (relevant, tentative, rejected).

I am thinking of looking through the statistical tests and return the ranking by p-value. If you like this issue and have a clear idea how to implement it, let me know.

I am trying to work on it on my fork.

DreHar commented 3 years ago

I know this doesnt directly answer your question. When I want to minimize the features I often do a feature reduction after the all relevant feature selection step. Forward or backward stepwise feature elimination depending on whether you want choose very few features or only drop a few respectively. I have also found that some simulated annealing helps a lot in practice.

This might help in practice because highly correlated features will all have high p values. So you might throw out features which are less statistically relevant but have more orthogonal value.

Sorry for the tangent but thought it might help