Open DomHudson opened 4 years ago
I'm not sure about the name of the parameter, but I think we could consider a flag for having lighter selectors. We should also document that the corresponding attributes aren't available when the flag is True.
If we start storing the selected indices instead of an array of shape n_features, we'll also need to re-consider how get_support()
and _get_support_mask
interact with each-other
Thanks for opening the issue. Why is using indices a decrease in model interpretability? This is just storing a dense vs a sparse vector, right? [edit]I now realize that you also need to drop the existing attributes for this to actually help, which does remove some information [/edit].
In principle, using a selector here is not really needed, and you'll get less memory footprint if you just don't do selection.
Still, it would be nice to support a sparse mask / indices. The linear models have a "sparsify" method: https://github.com/scikit-learn/scikit-learn/blob/7cc0177f8e8e958b6291433274a07cc67f933985/sklearn/linear_model/_base.py#L357
We could do the same here basically. This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint. We might make the pvalues_
etc sparse and only keep the ones that are not masked out, that would probably only require minimum changes in the logic.
This method is never called automatically, so it would be up to the user to call it to reduce the memory footprint.
@amueller you'd be fine with est.sparsify
basically delete the variances_
attribute? I think I'd still prefer having an __init__
parameter for this, because it would also allow us to avoid computing e.g. pvalues_
(which isn't needed to know the selected indices). With a call to sparsify()
, the attribute will still be computed even though it will never be used
How would you not compute the pvalues_
? They are returned from score_func
.
Ah indeed.
I still find that deleting an attribute would be a surprising consequence of calling a method called sparsify
though. sparsify()
for the linear models only converts coef_
to a sparse matrix but does not delete anything
yes, I think I wouldn't delete the attributes but replace the values which correspond with dropped features with zeros and make the arrays sparse.
Are we sure this is not just a case where the user really just wants a safe way to export a compact predictive model, i.e. onnx might be a better pick?
Thank you very much for all the engagement in this ticket!
@jnothman Thanks for your response! Agreed that exporting to a compact format like ONNX would probably satisfy the need. I suppose it comes down a design decision on how much of a focus there is on model size and memory use. I do think these pipeline components in particular are surprisingly heavy.
Definitions
By "selectors", I'm referring to the set of classes that implement
sklearn.feature_selection._base.SelectorMixin
.Examples are:
SelectKBest
VarianceThreshold
.Summary
I'd like to suggest either:
These classes save large matrices in their state which, for the most common cases, are not consumed.
Motivation
Consider a fairly typical NLP classifier as so:
The
FeatureHasher
is described as "a low-memory alternative to DictVectorizer and CountVectorizer". The disadvantage of theFeatureHasher
is that it will always produce a matrix withn_features
regardless of the input size; therefore, using classes likeVarianceThreshold
andSelectKBest
are commonly used to reduce the size of the feature matrix.The current implementation of these classes mitigates much of the low-memory benefit of the FeatureHasher due to the large amount of information saved on their state.
For example:
VarianceThreshold
saves the matrixvariances_
of shape(n_features,)
where n_features is the size of the input.SelectKBest
saves two matrices with this shape:scores_
andpvalues_
Describe your proposed solution
I suggest that for this type of transformer, the state saved is a single numpy array of only the column indices to retain. An array in this form can be retrieved by calling
get_support(indices=True)
on a fitted selector instance.The disadvantage of this approach is a decrease of model explain-ability.
To examine the difference of this approach, I designed the following implementation:
Comparing the size of the pickled objects show a dramatic decrease:
Setup code
SelectKBest
VarianceThreshold