scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.46k stars 252 forks source link

ENH: Add option for minimum number of confirmed features #96

Open MarkPundurs opened 3 years ago

MarkPundurs commented 3 years ago

While running sklearn.model_selection.GridSearchCV on a BorutaPy-based estimator (code below), I got the nonblocking error ValueError: Found array with 0 feature(s) (shape=(694, 0)) while a minimum of 1 is required. (full error message below) In this context, it would be useful to specify that BorutaPy() select some nonzero minimum number of features.

Full error message

Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "<ipython-input-315-84e5437b8711>", line 19, in fit
    self.estimator_.fit(X_filt, y)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 304, in fit
    accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\base.py", line 432, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 802, in check_X_y
    estimator=estimator)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\Users\pundumx\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 661, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(694, 0)) while a minimum of 1 is required.

Code for grid search

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, clone
from sklearn.utils.metaestimators import if_delegate_has_method

# estimator class based on example at https://scikit-learn.org/0.23/auto_examples/cluster/plot_inductive_clustering.html
class BorutaPy_Estimator(BaseEstimator):
    def __init__(self, estimator, n_estimators=1000, perc=0):
        self.estimator = estimator
        self.n_estimators = n_estimators
        self.perc = perc

    def fit(self, X, y):
        self.estimator_ = clone(self.estimator)
        self.feat_selector = BorutaPy(self.estimator_, n_estimators=self.n_estimators, random_state=1, perc=self.perc)
        self.feat_selector.fit(X, y)
        X_filt = self.feat_selector.transform(X)
        self.estimator_.fit(X_filt, y)
        return self

    @if_delegate_has_method(delegate='estimator_')
    def predict_proba(self, X):
        return self.estimator_.predict_proba(self.feat_selector.transform(X))

    @if_delegate_has_method(delegate='estimator_')
    def predict(self, X):
        return self.estimator_.predict(self.feat_selector.transform(X))

rf = RandomForestClassifier(n_jobs=-1, ccp_alpha=0.000005, max_features=0.05)
feat_selector = BorutaPy_Estimator(rf, n_estimators=1000)
param_grid = {'perc': [100, 90, 80, 70, 60]}
gs = GridSearchCV(feat_selector, param_grid, scoring='accuracy')
gs.fit(X_train, y_train) 
gs.cv_results_['mean_test_score'], gs.best_score_, gs.best_params_