scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.46k stars 252 forks source link

[BUG/FEATURE] Categorical Features in LightGBM Fail #81

Closed rohan-gt closed 4 years ago

rohan-gt commented 4 years ago

In the following code example, the input data has categorical columns in the Pandas DataFrame coded as category type:

# define Boruta feature selection method
lgbm = LGBMRegressor(n_jobs=-1)
feat_selector = BorutaPy(lgbm, n_estimators='auto', verbose=2, random_state=1)

# find all relevant features - 5 features should be selected
feat_selector.fit(X, y)

It throws the following error. Is there a way to add support for categorical variables:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-58193b453da2> in <module>()
    25 
    26 # find all relevant features - 5 features should be selected
---> 27 feat_selector.fit(df_data_x, feu.df_data_y)
    28 
    29 # check selected features - first 5 features are selected

4 frames
/usr/local/lib/python3.6/dist-packages/boruta/boruta_py.py in fit(self, X, y)
    199         """
    200 
--> 201         return self._fit(X, y)
    202 
    203     def transform(self, X, weak=False):

/usr/local/lib/python3.6/dist-packages/boruta/boruta_py.py in _fit(self, X, y)
    249     def _fit(self, X, y):
    250         # check input params
--> 251         self._check_params(X, y)
    252         self.random_state = check_random_state(self.random_state)
    253         # setup variables for Boruta

/usr/local/lib/python3.6/dist-packages/boruta/boruta_py.py in _check_params(self, X, y)
    515         """
    516         # check X and y are consistent len, X is Array and y is column
--> 517         X, y = check_X_y(X, y)
    518         if self.perc <= 0 or self.perc > 100:
    519             raise ValueError('The percentile should be between 0 and 100.')

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    753                     ensure_min_features=ensure_min_features,
    754                     warn_on_dtype=warn_on_dtype,
--> 755                     estimator=estimator)
    756     if multi_output:
    757         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    569         # make sure we actually converted to numeric:
    570         if dtype_numeric and array.dtype.kind == "O":
--> 571             array = array.astype(np.float64)
    572         if not allow_nd and array.ndim >= 3:
    573             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: could not convert string to float: 'Autauga'
danielhomola commented 4 years ago

You need to preprocess your categorical data before chucking it into boruta, just like you would before using any other scikit learn functionality. Strings are not numbers, ML algos work on numbers.

rohan-gt commented 3 years ago

@danielhomola True but LightGBM supports category type variables so it would be a useful feature to have

yehjames commented 3 years ago

same situation here. hope lightGBM can support this issue.