scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.5k stars 25.28k forks source link

Standard "Total Variance" Scaler #27957

Open rebeccaherman1 opened 9 months ago

rebeccaherman1 commented 9 months ago

Desired feature

A preprocessor that removes the mean for each feature, and then scales the total variance of the dataset, rather than the variance of each feature, to 1.

Proposed Solution

A new preprocessor that operates like StandardScaler but automatically scales total-variance instead of the variance of individual feature

Possible Alternatives

A new input parameter for StandardScaler that allows the user to set the variance of each feature, or which allows the user to identify groups of features to be considered individual "macro" features

Additional context

Intended use case for situations where more than one feature (column) is associated with the same data-concept (like when including multiple points in space for sea surface temperature in the Pacific and also multiple points in space for sea level pressure in the Atlantic)

adrinjalali commented 9 months ago

That's an interesting usecase. What do others think? @scikit-learn/core-devs

fkdosilovic commented 9 months ago

I'm interested in implementing this feature if @scikit-learn/core-devs decide to go in this direction.

fkdosilovic commented 9 months ago

To add my two cents to the discussion:

A new preprocessor that operates like StandardScaler but automatically scales total-variance instead of the variance of individual feature

I'm guessing that we should also consider adding the "robust" analog to this new TotalVarianceScaler, i.e. RobustTotalVarianceScaler. And, of course, the functional implementations, total_variance_scale and robust_total_variance_scale.

A new input parameter for StandardScaler that allows the user to set the variance of each feature, or which allows the user to identify groups of features to be considered individual "macro" features

I'm not a fan of this alternative approach since it seems as a rare use case. Additionally, this would require adding the same parameter to other preprocessing functions and transformers, e.g. MinMaxScaler, MaxAbsScaler, Normalizer, etc., to keep the API consistent.

jnothman commented 9 months ago

From an implementation angle we are just reshaping the data to one column before fitting the scalar etc and this might make sense as a parameter (single_distribution?) rather than a new estimator.

But I do think it introduces complexity that would need to be justified not just through a theoretical use case, but through empirical demonstration that this actually makes a difference for learning from a dataset of more than a few points.

fkdosilovic commented 8 months ago

From an implementation angle we are just reshaping the data to one column before fitting the scalar etc and this might make sense as a parameter (single_distribution?) rather than a new estimator.

From an implementation viewpoint yes, but the semantics of StandardScaler and the proposed new scaler are different; not by much, but still different enough. It seems to me that the explicit classes and functions would be visible more, in the API documentation for instance, as opposed to adding the new parameter. Also, as I mentioned above, I believe that adding a new parameter will require updating other scalers (e.g. MinMaxScaler, MaxAbsScaler, Normalizer, etc.) to keep the API consistent across scalers. Finally, adding a new parameter will pollute the current interface of the __init__ method, which is very intuitive and clean as is.

But I do think it introduces complexity that would need to be justified not just through a theoretical use case, but through empirical demonstration that this actually makes a difference for learning from a dataset of more than a few points.

Agree.

rebeccaherman1 commented 8 months ago

@jnothman @fkdosilovic Thank you for thinking about this!

But I do think it introduces complexity that would need to be justified not just through a theoretical use case, but through empirical demonstration that this actually makes a difference for learning from a dataset of more than a few points.

The use case is not so theoretical! In fact, it was inspired by my current project, and I believe I have such empirical documentation already. I am in the field of causal learning, and am working with tigramite. I am working on methods for reducing information-losing dimension reduction prior to time-series causal learning, which typically takes only a single timeseries per conceptual node. Predictably, I found that after applying the StandardScaler to my complete dataset, the relative dimension of the different nodes dramatically affects learned relationships between them.

I phrased the use case in a general way because I believe the problem is not specific to my current project -- I believe that this must also be a problem for all methods of computer learning with multi-dimensional variables, and, as a climate scientist, I assert that this is a problem for all analysis in my field.

From an implementation angle we are just reshaping the data to one column before fitting the scalar etc and this might make sense as a parameter (single_distribution?) rather than a new estimator.

That makes sense @jnothman. I happened to do something slightly different after posting in order to see the empirical difference my requested estimator would make in my current analysis. Here's what I did:

class StandardTotalVarianceScaler(StandardScaler):
    '''As in sklearn's StandardScaler, but total variance is scaled rather  
    than feature variance. Total variance of an array A_ij with I samples  
    and J features is defined to be sum_ij(a_ij)/I.'''
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def fit(self, X, y=None, sample_weight=None):
        T = super().fit(X, sample_weight=sample_weight)
        T.scale_ = np.sqrt(np.sum(T.var_))*np.ones(T.scale_.shape)
        return T

Tigramite already has code that divides datasets into query-related groups and fits a deepcopy of the chosen sklearn transformation to each part, because we regularly later want to transform only a part of the data. I've been working on a modification that would divide the dataset into the conceptual-variables instead of query-related groups, and then apply deepcopy's of the chosen sklearn transformation to each part. This is the context in which I find the total-variance scaler useful, whether it is implemented as I have done above or as @jnothman suggests.

But I do imagine that this new scaler would be more widely useful if there was an sklearn-supported way of dividing datasets into groups. To this end, I had originally imagined an alternate implementation employing an additional input parameter:

A new input parameter for StandardScaler that allows the user to set the variance of each feature, or which allows the user to identify groups of features to be considered individual "macro" features

I suppose I misspoke above; I was actually imagining an additional input parameter to the fit function, that would be somewhat parallel to sample_weight.

I'm not a fan of this alternative approach since it seems as a rare use case. Additionally, this would require adding the same parameter to other preprocessing functions and transformers, e.g. MinMaxScaler, MaxAbsScaler, Normalizer, etc., to keep the API consistent.

I actually strongly disagree with your perception @fkdosilovic that the use case would be rare, and would love to see dataset-dividing logic for these other Scalers as well. But perhaps, as you say, this additional input parameter is not the best approach...

It might be nice instead to have a DatasetDivider object which would be sortof like a pipeline, but instead of iteratively applying a list of transformations, it would divide the data into user-designated groups and then apply different instances of the chosen transformation (which could be a pipeline) separately to each group. In this scenario, this functionality would be available for all transformations without further modifications to the other Scalers. For some preprocessing transformations (like StandardScaler), using a dataset-divider would make no difference. However it would make a difference for any implementation of the total-variance scaler, and for all sklearn decompositions. If this new object could accept data from a subset of data groups for transform and inverse_transform, then we wouldn't need the logic I've been implementing in tigramite or the logic that's already there to deal with multiple parallel sklearn transformations, or any other similar and redundant code implemented by other sklearn users.

jnothman commented 8 months ago

It might be nice instead to have a DatasetDivider object which would be sortof like a pipeline, but instead of iteratively applying a list of transformations, it would divide the data into user-designated groups and then apply different instances of the chosen transformation (which could be a pipeline) separately to each group.

This sounds a lot like a ColumnTransformer, but maybe I'm missing something.

rebeccaherman1 commented 8 months ago

Indeed, it is a lot like ColumnTransformer (sorry I didn't see it before!), though ColumnTransformer appears to be missing some key functionalities. I would love it if they were added.

Most importantly, according to the linked documentation, ColumnTransformer seems to have no inverse_transform functionality. Why is that? Can it be remedied? It doesn't seem to me like that should be too difficult.

Secondly, it would be really wonderful if it was possible to subset a fitted ColumnTransformer by name of the column groups or by column index, thus allowing the researcher to transform and inverse_transform a subset of the column data. Could that be implemented within the code for ColumnTransformer?

With these changes, ColumnTransformer would fulfill my needs here in combination with the StandardTotalVarianceScaler, and use of sklearn in tigramite (and other similar applications) would become as elegant as possible.

adrinjalali commented 8 months ago

I wouldn't mind seeing a PR adding an argument to StandardScaler and RubustScaler, to see what kind of complexity it adds to the estimators.

fkdosilovic commented 8 months ago

It seems much easier to add additional class to the sklearn.preprocessing package than to change the current scalers. Something similar to the ColumnTransformer, but specific for the use-case proposed by @rebeccaherman1 seems more sensible. Here is an example implementation (did not check the code, pasting it as an idea at the moment):

class GroupScaler:
    def __init__(self, base_scaler: str, *, groups: list[list[int]]=None):
        scaler_mapping = {
            'standard': StandardScaler,
            'normalize': Normalizer,
            'minmax': MinMaxScaler,
            'maxabs': MaxAbsScaler
        }

        self.scalers = {i: scaler_mapping[base_scaler]() for i, group in enumerate(groups)}
        self.groups = groups

    def fit(self, X, y=None):
        for i, group in enumerate(self.groups):
            self.scalers[i].fit(X[:, group].reshape(-1, 1))
        return self

    def transform(self, X):
        X_transformed = X.copy()
        for i, group in enumerate(self.groups):
            X_group = X[:, group]
            X_transformed[:, group] = self.scalers[i].transform(X_group.reshape(-1, 1)).reshape(X_group.shape)
        return X_transformed

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

    def inverse_transform(self, X):
        X_inversed = X.copy()
        for i, group in enumerate(self.groups):
            X_group = X[:, group]
            X_inversed[:, group] = self.scalers[i].inverse_transform(X_group.reshape(-1, 1)).reshape(X_group.shape)
        return X_inversed

If groups is None, the GroupScaler should behave as a single base scaler (not implemented yet).

Of course the class needs to be modified to match the scikit-learn's guidelines, but illustrates the basic idea.

GaelVaroquaux commented 8 months ago

Something similar to the ColumnTransformer, but specific for the use-case proposed by @rebeccaherman1

The ColumnTransformer is a complex beast; typically it needs to evolve to cater for all types of DataFrames. I'd rather minimize such classes. I would be more comfortable with something like the standard scaler, to be combined with the ColumnTransformer.

fkdosilovic commented 8 months ago

The ColumnTransformer is a complex beast; typically it needs to evolve to cater for all types of DataFrames. I'd rather minimize such classes. I would be more comfortable with something like the standard scaler, to be combined with the ColumnTransformer.

Yes, I agree that the number of complex classes/transformers should be minimized and avoided if possible. I did not phrase my comment precisely enough, my bad.

But, I believe, to avoid adding additional parameters to current scalers, an additional "meta-scaler", as proposed above, would be the best compromise amongst additional complexity to the API and implementation, and additional maintenance efforts.

rebeccaherman1 commented 8 months ago

I think I agree with both of you that adding an additional parameter to the scalers is not the best approach. I think I also agree with @GaelVaroquaux that it would be better to make use of the existing ColumnTransformer (which would require some modifications) in combination with the TotalVarianceScaler than to make a new similar object.

I would prefer the generality of the ColumnTransformer over the GroupScaler proposed by @fkdosilovic because I would like to be able to use such functionality also for sklearn decomposition objects, like PCA. I also, actually, like the ability to define different sklearn transformations for different columns, which could be useful when some of the data have different types (though it would be nice to have a simple helper function that would create a ColumnTransformer under the assumption that all transformations are the same using a list of indices like the syntax proposed by @fkdosilovic ).

This morning I was working on familiarizing myself with the code for ColumnTransformer to see if I might be able to propose something, but I haven't yet finished wrapping my head around exactly what's going on in there.

adrinjalali commented 8 months ago

We definitely don't want to add more ColumnTransformer like objects. And the parameter added to StandarScaler wouldn't support different groups. The user then would combine a ColumnTransformer with a StandardScaler which uses the whole data to calculate statistics.

rebeccaherman1 commented 8 months ago

(you would still need a Scaler that does Total variance for this to work @GaelVaroquaux and @adrinjalali. Otherwise, I see that the ColumnTransformer has a weight input, but I don't want to have to, as a user, calculate those weights from the sizes of my groups)

adrinjalali commented 8 months ago

I'm suggesting to add a constructor argument to StandardScaler to calculate statistics from the whole data, that would solve the issue.

lorentzenchr commented 8 months ago

What @adrinjalali means, I guess, is

  1. Add a parameter (naming to be defined) to StandardScaler.
  2. Use it in combination with a ColumnTransformer like this
    col_trans = ColumnTransformer(
      [
          ("group1", StandardScaler(total_variance=True), ["col_1", "col_2", "col_3"]),
          ("group2", StandardScaler(total_variance=True), ["col_3", "col_4", "col_5"]),
          ("std_scaled", StandardScaler(total_variance=False), ["col_5", "col_6", "col_7"]),
      ]
    )
rebeccaherman1 commented 8 months ago

That would definitely work, if ColumnTransformer is also given inverse_transform functionality and, I hope, also a way to either pass in a subset of the columns for transform and inverse_transform, or to subset the ColumnTransformer object itself it by column/transform name. I see that, if a dataframe with variable names is used, one can pass in data for transformation with more columns than were used in the fit.

It would be really wonderful to be able to do the same with fewer columns, and even when a dataframe is not used. In this case, the user could pass in the names of the columns, or identify which named transformations will be provided with data.

rebeccaherman1 commented 8 months ago

Should I make a new thread for my proposed changes to ColumnTransformer, or is it fine here?

lorentzenchr commented 8 months ago

Different topic => difference/new issue.

rebeccaherman1 commented 8 months ago

ok! Then, in this issue, it should be decided whether to add a new Scaler, or to add an argument total_variance to the existing StandardScaler (and perhaps other similar preprocessors as well). For continued conversation on Expanded ColumnTransformer functionality, see #28130

lorentzenchr commented 8 months ago

Already answered in https://github.com/scikit-learn/scikit-learn/issues/27957#issuecomment-1877068489:

I'm suggesting to add a constructor argument to StandardScaler to calculate statistics from the whole data, that would solve the issue.

rebeccaherman1 commented 8 months ago

OK! Sorry I wasn't sure if that was a suggestion or decision. Do we want me to make a pull request for it, or is someone else interested in implementing it?

lorentzenchr commented 8 months ago

I understand that it is so easy to conclude that there is a decision. @adrinjalali suggested and I gave my 👍 to it which results in 2 core devs agreeing. That is already a strong hint. Similar for the question about ColumnTransformer, several core devs had concerns about that.

@rebeccaherman1 If you want to contribute a PR that would be nice. I'm not aware of anybody else planning to do so.

rebeccaherman1 commented 8 months ago

ok @lorentzenchr. Which other preprocessors would we also want to have this functionality?

lorentzenchr commented 8 months ago

@rebeccaherman1 I don't follow. If you want an option for total variance in StandardScaler, then a PR for that is welcome. No other preprocessors / estimators are discussed here.

rebeccaherman1 commented 8 months ago

@lorentzenchr for instance, @adrinjalali mentioned RubustScaler above

adrinjalali commented 7 months ago

@rebeccaherman1 you can start a PR with StandardScaler and when that gets to a conclusion, we can do the same for RobustScaler

mozgit commented 3 weeks ago

Hey @rebeccaherman1, thanks for proposing! That is a useful feature, I'm also interested in it. I see that there is a "green light" to implement it, but I don't see any development branches linked to it. So I wonder if there is some code already, maybe you could wrap it as a PR? In a meanwhile, I'll check if there is a simple way to make it done - please reach me out if you would like to contribute to the solution.