Feature selector instances are too heavy-weight

DomHudson commented 4 years ago

Definitions

By "selectors", I'm referring to the set of classes that implement sklearn.feature_selection._base.SelectorMixin.

Examples are:

SelectKBest
VarianceThreshold.

Summary

I'd like to suggest either:

modifying these "selectors" to save less information in their state or
providing a more minimal set of "selectors" with less information saved in their state

These classes save large matrices in their state which, for the most common cases, are not consumed.

Motivation

Consider a fairly typical NLP classifier as so:

Pipeline([
    ('feature-hashing', FeatureHasher(n_features=2**20)),
    ('variance', VarianceThreshold(threshold=0.0)),
    ('clf', LogisticRegression())
])

The FeatureHasher is described as "a low-memory alternative to DictVectorizer and CountVectorizer". The disadvantage of the FeatureHasher is that it will always produce a matrix with n_features regardless of the input size; therefore, using classes like VarianceThreshold and SelectKBest are commonly used to reduce the size of the feature matrix.

The current implementation of these classes mitigates much of the low-memory benefit of the FeatureHasher due to the large amount of information saved on their state.

For example:

VarianceThreshold saves the matrix variances_ of shape (n_features,) where n_features is the size of the input.
SelectKBest saves two matrices with this shape: scores_ and pvalues_

Describe your proposed solution

I suggest that for this type of transformer, the state saved is a single numpy array of only the column indices to retain. An array in this form can be retrieved by calling get_support(indices=True) on a fitted selector instance.

This dramatically reduces the size of the object (both when pickled and in-memory) without a decrease in algorithm performance.
This implementation scales a lot better than the current implementation as the size of the state scales with the output shape rather than the input shape.

The disadvantage of this approach is a decrease of model explain-ability.

To examine the difference of this approach, I designed the following implementation:


class NumpyColumnFilter:

    def __init__(self, relevant_columns):
        """ Constructor.

        :param numpy.ndarray relevant_columns: 1D numpy array containing indicies of columns to
            retain.
        :return void:
        """
        self._relevant_columns = relevant_columns

    @classmethod
    def from_sklearn_selector(cls, selector):
        """ Produce a NumpyColumnFilter from an implementation of sklearn's SelectorMixin.

        :param mixed selector:
        :return NumpyColumnFilter:
        """
        return cls(selector.get_support(indices=True))

    def apply(self, X):
        """ Select just the relevant columns.

        :param numpy.ndarray X:
        :return numpy.ndarray:
        """
        return X[:, self._relevant_columns]

class FeatureSelector:

    def __init__(self, selector_class, **selector_kwargs):
        """ Constructor.

        :param mixed selector_class: Filter to remove feature indicies.
        :param dict selector_kwargs: Keyword arguments to pass when instantiating the selector class
        :return void:
        """
        self._selector_class = selector_class
        self._selector_kwargs = selector_kwargs
        self._column_filter = None

    def _is_fitted(self):
        """ Is the filter fitted?

        :return bool:
        """
        return self._column_filter is not None

    def fit_transform(self, *args, **kwargs):
        """ Fit and transform.

        :return np.ndarray:
        """
        return self.fit(*args, **kwargs).transform(*args, **kwargs)

    def fit(self, *args, **kwargs):
        """ Fit the algorithm.

        :param np.ndarray X:
        :raises Exception:
        :return self:
        """
        fitted_selector = self._selector_class(**self._selector_kwargs).fit(*args, **kwargs)

        self._column_filter = NumpyColumnFilter.from_sklearn_selector(fitted_selector)
        return self

    def transform(self, X, *args, **kwargs):
        """ Select just the relevant columns.

        :param np.ndarray X:
        :raises Exception:
        :return np.ndarray:
        """
        if not self._is_fitted():
            raise Exception('Not fitted!')

        return self._column_filter.apply(X)

Comparing the size of the pickled objects show a dramatic decrease:

Setup code

import pickle
import sys

import numpy as np
from sklearn import datasets
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import VarianceThreshold

def pickled_size(item):
    return sys.getsizeof(pickle.dumps(item))

X, y = datasets.make_classification(n_samples = 100, n_features = 10)

# Add one-million features without any variance.
X = np.concatenate((X, np.zeros((100, int(1e6)))), axis = 1)

`SelectKBest`

sklearn_selector_class = SelectKBest().fit(X, y)
print(pickled_size(sklearn_selector_class))
>>> ~ 16 megabytes

feature_selector = FeatureSelector(selector_class = SelectKBest).fit(X, y)
print(pickled_size(feature_selector))
>>> ~ 500 bytes

`VarianceThreshold`

sklearn_selector_class = VarianceThreshold().fit(X, y)
print(pickled_size(sklearn_selector_class))
>>> ~ 8 megabytes

feature_selector = FeatureSelector(selector_class = VarianceThreshold).fit(X, y)
print(pickled_size(feature_selector))
>>> ~ 400 bytes

NicolasHug commented 4 years ago

I'm not sure about the name of the parameter, but I think we could consider a flag for having lighter selectors. We should also document that the corresponding attributes aren't available when the flag is True.

If we start storing the selected indices instead of an array of shape n_features, we'll also need to re-consider how get_support() and _get_support_mask interact with each-other

amueller commented 4 years ago

Thanks for opening the issue. Why is using indices a decrease in model interpretability? This is just storing a dense vs a sparse vector, right? [edit]I now realize that you also need to drop the existing attributes for this to actually help, which does remove some information [/edit].

In principle, using a selector here is not really needed, and you'll get less memory footprint if you just don't do selection.

Still, it would be nice to support a sparse mask / indices. The linear models have a "sparsify" method: https://github.com/scikit-learn/scikit-learn/blob/7cc0177f8e8e958b6291433274a07cc67f933985/sklearn/linear_model/_base.py#L357

We could do the same here basically. This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint. We might make the pvalues_ etc sparse and only keep the ones that are not masked out, that would probably only require minimum changes in the logic.

NicolasHug commented 4 years ago

This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint.

@amueller you'd be fine with est.sparsify basically delete the variances_ attribute? I think I'd still prefer having an __init__ parameter for this, because it would also allow us to avoid computing e.g. pvalues_ (which isn't needed to know the selected indices). With a call to sparsify(), the attribute will still be computed even though it will never be used

amueller commented 4 years ago

How would you not compute the pvalues_? They are returned from score_func.

NicolasHug commented 4 years ago

Ah indeed.

I still find that deleting an attribute would be a surprising consequence of calling a method called sparsify though. sparsify() for the linear models only converts coef_ to a sparse matrix but does not delete anything

amueller commented 4 years ago

yes, I think I wouldn't delete the attributes but replace the values which correspond with dropped features with zeros and make the arrays sparse.

jnothman commented 4 years ago

Are we sure this is not just a case where the user really just wants a safe way to export a compact predictive model, i.e. onnx might be a better pick?

DomHudson commented 4 years ago

Thank you very much for all the engagement in this ticket!

@jnothman Thanks for your response! Agreed that exporting to a compact format like ONNX would probably satisfy the need. I suppose it comes down a design decision on how much of a focus there is on model size and memory use. I do think these pipeline components in particular are surprisingly heavy.

scikit-learn / scikit-learn