Open Sycor4x opened 5 years ago
That inverse_transform
is shared with other feature selector, but I'm okay with this being specialised. To maintain backwards compatibility this might need to be controlled by a parameter which would change from "zero" to "mean" over a deprecation period. PR welcome.
@jnothman I'd like to work on this if that is okay.
@theoptips and I want to work on this one
Working with @MDouriez adding myself as a participant. #scikitlearnsprint
So I'm not sure if this change will be beneficial. Why should VarianceThreshold have different behavior from other feature selection methods? You could argue the same for other feature selectors, right?
I'm also not sure I understand the semantics of looking at reconstruction error in feature selection. @Sycor4x can you explain why you're doing this or what the interpretation of this is?
@jnothman did you mean adding a parameter to the method or the class? Do we want to later deprecate the parameter or keep it?
I meant a temporary parameter. in the class I suppose. Not exactly elegant. But yet, need to confirm if it will solve the issue.
Somehow I missed the notifications for this. My recollection from several years ago is that I expected transform
and inverse_transform
to be inverses. To be explicit, if Z = obj.fit_transform(X)
, then my expectations is that np.abs(X - obj.inverse_transform(Z).mean()
to be small, for instance.
The reason I cared, as far as I can recall, is that if I'm applying a sequence of transformations to X
(e.g. a Pipeline
), and I want to assess how much information is discarded end-to-end, this has a sharp corner because "un-doing" the zero-variance screen can be surprisingly large, and the reason it is so large has more to do with how the back-transformation is performed than the data or the Pipeline.
Today,
sklearn.feature_selection.VarianceThreshold.inverse_transform
method fills in zeros to features which were removed for having too-small variance. This is certainly predictable, easy to implement and easy to explain.However, filling in zeros without respect to the data passed to
fit
means that the reconstruction error can become arbitrarily large. For example, suppose that one of the features in your data always takes the value10**6
. This clearly has zero variance, since it always takes the same value; however, filling in zeros for that feature when the data viatransform
andinverse_transform
will result in an output which dramatically differs from the input.Instead, I think it would make sense to fill in the mean of the columns when using
inverse_transform
. The means can be computed and stored from the data passed tofit
. This will make the reconstruction of the transformed data viainverse_transform
more closely reflect the data that was passed tofit
because any columns which are removed for having variance less thanthreshold
must, by definition, be tightly grouped about the sample mean.Naturally, in the special cases where the sample means of input features passed to
fit
are already zero, the proposedinverse_transform
method will function the same way it does today, as it will fill in zero values for those features.In terms of code, this just means keeping an array of column means in addition to the indices of the removed columns.
Of course, it's possible that I have missed an important subtlety, or there is a competing concern which outweighs the argument that I've outlined here. If that's the case, I'd like to know what I've missed!