rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.82k stars 853 forks source link

Feature selection methods should also inherit from sklearn.feature_selection.base.SelectorMixin and define _support_mask method #468

Open hermidalc opened 5 years ago

hermidalc commented 5 years ago

mlxtend feature selection methods do not follow design pattern of sklearn feature selection methods, they should also inherit from sklearn.feature_selection.base.SelectorMixin and define a _support_mask method. Not doing this causes issues on the user side where this is expected.

rasbt commented 5 years ago

Yeah, the implementation was not strictly mimicking sklearn's but just being compatible. Adding a

_support_mask method.

I.e., this could be easily done via

    def _get_support_mask(self):
        check_is_fitted(self, 'support_')
        mask = np.zeros((self._n_features,), dtype=np.bool)
        # list to avoid IndexError in old NumPy versions
        mask[list(self.support_)] = True
        return mask

it could also inherit from sklearn.feature_selection.base.SelectorMixin probably without breaking anything. Do you know though what missing functionality SelectorMixin would currently bring?

hermidalc commented 5 years ago

SelectorMixin has get_support() exposed method which calls _get_support_mask(), so that is needed. Other than that I guess you don't need the SelectorMixin inverse_transform() method. Though IMO in general I think you should try to use as much from sklearn API and design as possible instead of grabbing and modifying only what you need.

Here's what I wrote for ColumnSelector() locally today. It doesn't support pandas dataframes like yours since I didn't need it for now, but will look into merging how your transform() method code works with SelectorMixin transform() method and also write an inverse_transform(). If want then will make a pull request.

import numpy as np
from sklearn.base import BaseEstimator
from sklearn.feature_selection.base import SelectorMixin
from sklearn.utils import check_X_y
from sklearn.utils.validation import check_is_fitted

class ColumnSelector(BaseEstimator, SelectorMixin):
    """Manual column feature selector

    Parameters
    ----------
    cols : array-like (default=None)
        A list specifying the feature indices to be selected.
    """
    def __init__(self, cols=None):
        self.cols = cols

    def fit(self, X, y):
        """
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
            The training input samples.

        y : array-like, shape = [n_samples]
            The target values (class labels in classification, real numbers in
            regression).

        Returns
        ---------
        self : object
            Returns self.
        """
        X, y = check_X_y(X, y)
        self._n_features = X.shape[1]
        return self

    def _check_params(self, X, y):
        if self.cols is not None:
            for col in self.cols:
                if not 0 <= col <= X.shape[1]:
                    raise ValueError(
                        "cols should be 0 <= col <= n_features; got %r."
                        "Use cols=None to return all features."
                        % col
                    )

    def _get_support_mask(self):
         check_is_fitted(self, '_n_features')
        if self.cols is None:
            mask = np.ones(self._n_features, dtype=bool)
        else:
            mask = np.zeros(self._n_features, dtype=bool)
            mask[list(self.cols)] = True
        return mask
rasbt commented 5 years ago

Though IMO in general I think you should try to use as much from sklearn API and design as possible instead of grabbing and modifying only what you need.

Yeah. It was first intentional to make it as simple as possible, which makes the testing easier, and additional functionality and methods would be added over time then.

I think in this case it would be fine to add both the SelectorMixin (for the get_support method) as well as the _get_support_mask for both the SequentialFeatureSelector and the ColumnSelector and would welcome PRs.

Btw do you know use cases for the get_support method via the scikit-learn API, e.g., a transformer or so that uses that? Because then it would be good to add this as a unit test.

hermidalc commented 5 years ago

Btw do you know use cases for the get_support method via the scikit-learn API, e.g., a transformer or so that uses that? Because then it would be good to add this as a unit test.

get_support is an essential method used in the user space. Every sklearn feature selection method provides it. You use it in order to get the mask, or integer index, of the features selected from your starting X pandas dataframe or numpy array.

rasbt commented 5 years ago

Yeah sure. I was more specifically wondering about scenarios where you need the mask over the feature indices directly. I.e., if you have a feature selection transformer in the pipeline, it wouldn't matter how the transform method is implemented. But I was wondering if there is a specific class in scikit-learn that relies on get_support so that we can add that to the unit test to test general compatibility of the feature selector with those.