Fit method at preprocessing file

tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python

https://tslearn.readthedocs.io

BSD 2-Clause "Simplified" License

2.88k stars 336 forks source link

Fit method at preprocessing file #280

Open sebastianpinedaar opened 4 years ago

sebastianpinedaar commented 4 years ago

In the preprocessing file, there are two scalers defined. However, the fit method should have an implementation. I mean: If I have a train and test time-series dataset, I compute the mean and std on the train to scale the data. Afterward, I apply the previously computed mean and std on the test data. I do not recompute the mean and std on the test data. With the current implementation, this is not possible. It would be great if the fit (o fit_transform) method saves the mean and std deviation, while transform only use them (instead of re-computing them).

johannfaouzi commented 4 years ago

It is usually common to scale each time series independently, that is the mean and standard deviation are computed for each time series independently, and the transformation is applied to each time series independently. For this kind of transformation, it does not really make to save the mean and standard deviation, because the training and test sets have different time series.

If you want to scale each time point independently, the easiest solution would to use the scalers from scikit-learn, but you would have to create an instance for each dimension / feature if you have multivariate time series, since their estimators expect the input shape to be 2D ((n_samples, n_features)).

trewaite commented 4 years ago

Hi Johann. I agree, it is common to scale time series independently, especially for global models. However, I still think that Sebastian has a point. What if the same time series is in both the train and test, just fractured by a split date. For example we are training a global model for forecasting multiple time series (for example retail forecasting), and want to validate it's performance on unseen data for the next n time steps. I think you would still use the same mean and std from the train to scale this testing data.

I would be willing to work on this if I understand the problem correctly.

johannfaouzi commented 4 years ago

I agree that, in the case of forecasting, standardization should use the mean and standard deviation computed on the training set and be applied on both the training and test set. However, for the moment, forecasting is little to not supported in tslearn. I will let @rtavenar tell if it's in the scope of this package or not.

Thank you for being willing to work on this!

GillesVandewiele commented 4 years ago

I would like to second that, even for shapelet extraction, I often calculate 1 global mean & stddev per dimension/channel and use these to normalize all of the timeseries. This is because often the magnitude of the signal is important as well. So I think there are indeed useful usecases. And the current minmaxscaler from sklearn would scale per column which is not what we want as well.