Parameter 'overlapping' in GramianAngularField clarification

Sharan123 commented 5 years ago

This is a question, and I hope it is ok to ask it here. While looking at the pyts.image.gaf module I've come across the parameter 'overlapping' that says "If True, reduce the size of each time series using PAA with possible overlapping windows."

However, I am not clear what effect does this parameter have on GAF image calculation? By how much does it reduce the size of the time series using Piecewise aggregate approximation, and also how does it determine possible overlapping windows (what are those windows)?

Also, in the example of GramianAngularField (https://pyts.readthedocs.io/en/latest/auto_examples/image/plot_gaf.html) i see that fit transform does not pass the y values, or are they encoded in the variable X somehow ?

johannfaouzi commented 5 years ago

Hi,

It's perfectly fine to ask question here, as there is nothing else set up to do so (I may create a gitter channel in the future if it becomes necessary).

However, I am not clear what effect does this parameter have on GAF image calculation? By how much does it reduce the size of the time series using Piecewise aggregate approximation, and also how does it determine possible overlapping windows (what are those windows)?

Quoting the article used for the implementation, Imaging Time Series to Improve Accuracy and Imputation, at the end of section 2.1

To reduce the size of the GAFs, we apply Piecewise Aggregation Approximation (PAA) to smooth the time series while preserving the trends.

PAA is a simple technique to decrease the number of points for time series: it applies a sliding window in which you take the mean of the values. For instance, if you have time series with 6 time points and you want a time series with 3 time points, PAA takes the mean value of time points 0 and 1, 2 and 3, 4 and 5.

What determines the use of PAA is the image_size parameter (see the documentation of pyts.image.GramianAngularField. By default, image_size=1., which means that PAA is not applied. If you set image_size=0.5, you will divide the size of the time series by 2. If you set image_size=10, PAA will be applied to have time series with 10 time points, regardless of the number of time points in the original time series.

The overlapping parameter is used when the final number of time points does not divide the original number of time points. For instance, if you have time series with 10 time points, and you want images to be 3x3, there are two possibilities:

with overlapping windows, the windows have the same size (4) and will be [0, 1, 2, 3], [3, 4, 5, 6], [6, 7, 8, 9] (the number indicate the indices for each window);
with non-overlapping windows, the windows will have different sizes and will be [0, 1, 2], [3, 4, 5, 6], [7, 8, 9].

When the final number of time points does divide the original number of time points, overlapping is not used (the windows are always non-overlapping). For instance, For instance, if you have time series with 10 time points, and you want images to be 5x5, the windows will always be [0, 1], [2, 3], [4, 5], [6, 7], [8, 9].

Also, in the example of GramianAngularField (https://pyts.readthedocs.io/en/latest/auto_examples/image/plot_gaf.html) i see that fit transform does not pass the y values, or are they encoded in the variable X somehow ?

The GAF transformation is totally independent of the labels (y). However, this parameter is still needed for building pipelines. For instance, sklearn.preprocessing.StandardScaler also has a y=None parameter, even though it just standardize features independently from the labels.

For instance, if you build a pipeline that transforms time series into images using GAF, then classify these images using a SVM, you can build a pipeline:

>>> from pyts.image import GramianAngularField
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from pyts.datasets import load_gunpoint

>>> # Load a dataset
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)

>>> # GAF-SVM classifier
>>> gaf= GramianAngularField()
>>> svc= SVC()
>>> gaf_svm= Pipeline([('gaf', gaf), ('svc', svc)])

>>> # Train the classifier on the training set
>>> gaf_svm.fit(X_train, y_train)

>>> # Evaluate the classifier on the test set
>>> gaf_svm.score(X_test, y_test)

Without the y parameter, it would not be possible to fit a pipeline like this (even though it is ignored for creating the images).

I hope that it answers your question. Feel free to ask more if I was not clear :)

Johann

Sharan123 commented 5 years ago

First, I would like to thank you, not just for this answer, but for this whole library that you have kindly provided open-source.

I've known how PAA worked, although the parameter 'overlapping' was not clear - not it is totally clear.

As for the 'y' value - I misinterpreted that that value has the timeseries values (as in X-values are x-coordinates -i.e time instances, and that the y values contain the y-coordinate values of univariate timeseries). I see now that 'y' actually represents the class label whereas the timeseries value is probably encoded somewhere in the variable X. Is this correct?

Thank you once again, Daniel

johannfaouzi commented 5 years ago

Thank you for your kind words.

As for the 'y' value - I misinterpreted that that value has the timeseries values (as in X-values are x-coordinates -i.e time instances, and that the y values contain the y-coordinate values of univariate timeseries). I see now that 'y' actually represents the class label whereas the timeseries value is probably encoded somewhere in the variable X. Is this correct?

Indeed. The notations come from the machine learning field, where X stands for the input data and y for the target vector (labels for classification, numbers for regression). In case of time series classification, my notation is that X is a dataset with shape

(n_samples, n_timestamps) for univariate time series,
(n_samples, n_features, n_timestamps) for multivariate time series.

Note that a feature corresponds to a dimension in a multivariate time series. This is quite different from the definition from standard machine learning.

Most of the functionalities made available are for univariate time series. The Gramian Angular Fields can only be used for univariate time series. If you want to use it for multivariate time series, you can use pyts.multivariate.transformation.MultivariateTransformer to transform each feature independently. The result will be a dataset with shape (n_samples, n_features, image_size, image_size) (use flatten=False to get this shape).

The few functionalities for multivariate time series are in pyts.multivariate (see the documentation of the module for more details).

Johann

johannfaouzi / pyts

Parameter 'overlapping' in GramianAngularField clarification #30