Open hermidalc opened 5 years ago
Yeah, the implementation was not strictly mimicking sklearn's but just being compatible. Adding a
_support_mask method.
I.e., this could be easily done via
def _get_support_mask(self):
check_is_fitted(self, 'support_')
mask = np.zeros((self._n_features,), dtype=np.bool)
# list to avoid IndexError in old NumPy versions
mask[list(self.support_)] = True
return mask
it could also inherit from sklearn.feature_selection.base.SelectorMixin
probably without breaking anything. Do you know though what missing functionality SelectorMixin would currently bring?
SelectorMixin has get_support() exposed method which calls _get_support_mask(), so that is needed. Other than that I guess you don't need the SelectorMixin inverse_transform() method. Though IMO in general I think you should try to use as much from sklearn API and design as possible instead of grabbing and modifying only what you need.
Here's what I wrote for ColumnSelector() locally today. It doesn't support pandas dataframes like yours since I didn't need it for now, but will look into merging how your transform() method code works with SelectorMixin transform() method and also write an inverse_transform(). If want then will make a pull request.
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.feature_selection.base import SelectorMixin
from sklearn.utils import check_X_y
from sklearn.utils.validation import check_is_fitted
class ColumnSelector(BaseEstimator, SelectorMixin):
"""Manual column feature selector
Parameters
----------
cols : array-like (default=None)
A list specifying the feature indices to be selected.
"""
def __init__(self, cols=None):
self.cols = cols
def fit(self, X, y):
"""
Parameters
----------
X : array-like, shape = [n_samples, n_features]
The training input samples.
y : array-like, shape = [n_samples]
The target values (class labels in classification, real numbers in
regression).
Returns
---------
self : object
Returns self.
"""
X, y = check_X_y(X, y)
self._n_features = X.shape[1]
return self
def _check_params(self, X, y):
if self.cols is not None:
for col in self.cols:
if not 0 <= col <= X.shape[1]:
raise ValueError(
"cols should be 0 <= col <= n_features; got %r."
"Use cols=None to return all features."
% col
)
def _get_support_mask(self):
check_is_fitted(self, '_n_features')
if self.cols is None:
mask = np.ones(self._n_features, dtype=bool)
else:
mask = np.zeros(self._n_features, dtype=bool)
mask[list(self.cols)] = True
return mask
Though IMO in general I think you should try to use as much from sklearn API and design as possible instead of grabbing and modifying only what you need.
Yeah. It was first intentional to make it as simple as possible, which makes the testing easier, and additional functionality and methods would be added over time then.
I think in this case it would be fine to add both the SelectorMixin
(for the get_support
method) as well as the _get_support_mask
for both the SequentialFeatureSelector
and the ColumnSelector
and would welcome PRs.
Btw do you know use cases for the get_support
method via the scikit-learn API, e.g., a transformer or so that uses that? Because then it would be good to add this as a unit test.
Btw do you know use cases for the
get_support
method via the scikit-learn API, e.g., a transformer or so that uses that? Because then it would be good to add this as a unit test.
get_support
is an essential method used in the user space. Every sklearn feature selection method provides it. You use it in order to get the mask, or integer index, of the features selected from your starting X pandas dataframe or numpy array.
Yeah sure. I was more specifically wondering about scenarios where you need the mask over the feature indices directly. I.e., if you have a feature selection transformer in the pipeline, it wouldn't matter how the transform method is implemented. But I was wondering if there is a specific class in scikit-learn that relies on get_support
so that we can add that to the unit test to test general compatibility of the feature selector with those.
mlxtend feature selection methods do not follow design pattern of sklearn feature selection methods, they should also inherit from sklearn.feature_selection.base.SelectorMixin and define a _support_mask method. Not doing this causes issues on the user side where this is expected.