johannfaouzi / pyts

A Python package for time series classification
https://pyts.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.76k stars 163 forks source link

Dimensionality Reduction (approximation) along columns/time axis? #8

Open legout opened 6 years ago

legout commented 6 years ago

Hi,

I wonder why the approximation functions PAA and DFT are applied to the rows? In my opinion based on what I found in the papers and dissertation of Patrick Schäfer, this should be applied to the columns (along the time axis). Am I wrong?

For example the code below returns an error:

import numpy as np
import pyts.approximation as pya

x = np.random.randn(100,2)

paa = pya.PAA(window_size=10)
x_paa = paa.fit_transform(x)

print('Shape of x {}'.format(x.shape))
print('Shape of x_paa {}'.format(x_paa.shape))

ValueError: 'window_size' must be lower or equal than the size of each time series.

However, what i´ve been expecting is the following:

import numpy as np
import pyts.approximation as pya

x = np.random.randn(100,2)

paa = pya.PAA(window_size=10)
x_paa = paa.fit_transform(x)

print('Shape of x {}'.format(x.shape))
print('Shape of x_paa {}'.format(x_paa.shape))

Shape of x (100, 2) Shape of x_paa (10, 2)

Regards, legout

johannfaouzi commented 6 years ago

Hi legout,

My convention is that each row is a time series, which means that the time axis is the second axis. For instance, x = np.random.randn(100,2) means that you have 100 time series of length 2. If your data is not in this format, you can just transpose your numpy array with the transpose method x.T.

If you're familiar with scikit-learn, you can think of the timestamps as the features of your data.

Best regards, Johann

legout commented 6 years ago

Hi Johann,

my fault was to think of every timestamp being a new sample and every feature being a different measure (e.g. temperature and pressure). But this is only true, if there is also one label/output at each timestamp (or multiple labels/outputs). **

However, if I wanna map one (multivariate) timeseries to one label/output (or multiple labels/outouts) every timestamp is a feature.

Btw, do you plan to implement WEASEL+MUSE into pyts?

Best regards, Legout

**That was the case for me in every previous project.

johannfaouzi commented 6 years ago

Hi legout,

Multivariate time series are currently not supported in pyts. Adding specific algorithms for multivariate time series would definitely be a great idea. However, pyts is not under very active development currently and I can't make any promise on a release date with such algorithms.

My on-the-fly thoughts for classification of multivariate time series would be to fit a classifier for each dimension and then use a voting classifier to predict one single label. The issue is that you lose the dependency between the dimensions though. You could also reduce the number of dimensions and use a single classifier, but it may be a bad idea if the time series are really different from each other in different dimensions.

Best regards, Johann

Sandy4321 commented 4 years ago

it would great to add Multivariate time series like https://github.com/patrickzib/SFA WEASEL+MUSE

johannfaouzi commented 4 years ago

Tools for multivariate time series are provided in the pyts.multivariate module. WEASEL+MUSE is implemented as pyts.multivariate.transformation.WEASELMUSE.

The literature for multivariate time series classification is quite shallow (probably due to the lack of datasets for a very long time). Nonetheless, if you consider each feature of a multivariate time series independently, you can use the utility classes pyts.multivariate.transformation.MultivariateTransformer and pyts.multivariate.classification.MultivariateClassifier to apply a univariate time series algorithm to each feature a multivariate time series dataset independently.

Hope this helps you a little.

Sandy4321 commented 4 years ago

Really great news You are the first to implement multivariate time series classification in python Only one important question Does your code support the mixture of categorical and continue s features?

johannfaouzi commented 4 years ago

Do you mean time series with categorical values? I don't think that I have ever seen any algorithm in the time series classification literature that can deal with that. Maybe Markov chains would be more suited for such features.

I think a few other Python packages like tslearn and sktime can also deal with multivariate time series.

Sandy4321 commented 4 years ago

they do not have for example https://github.com/alan-turing-institute/sktime/issues/235

when data is mixer of continues and categorical variables for each time sample? for example data samples are time t1: red , 0.4 , big , low, 234 time t2: green, 0.8, big, high, 12 time t3: green, 0.1, small, low, 34 etc

for example https://github.com/alan-turing-institute/sktime/blob/master/examples/03_classification_multivariate.ipynb have simulated data for only continues features

they https://github.com/tslearn-team/tslearn/issues/172 do have idea @Sandy4321 it's kind of a late reply, but is it possible to do some kind of initial preprocessing of your categorical variables? e.g. one hot encoding & apply the standard methods should be okay.

You can also apply one of the kernel methods & choose an appropriate kernel which can handle the categorical features...I think ARD kernel is one example, but I forget the details. You can see what the popular bayesian hyperparameter opt. packages do in this case

"ARD kernel is one example, but I forget the details." do you have idea what they mean?

https://www.cs.toronto.edu/~duvenaud/cookbook/ Discrete Data Kernels can be defined over all types of data structures: Text, images, matrices, and even kernels . Coming up with a kernel on a new type of data used to be an easy way to get a NIPS paper. How to use categorical variables in a Gaussian Process regression There is a simple way to do GP regression over categorical variables. Simply represent your categorical variable as a by a one-of-k encoding. This means that if your number ranges from 1 to 5, represent that as 5 different data dimensions, only one of which is on at a time.

Then, simply put a product of SE kernels on those dimensions. This is the same as putting one SE ARD kernel on all of them. The lengthscale hyperparameter will now encode whether, when that coding is active, the rest of the function changes. If you notice that the estimated lengthscales for your categorical variables is short, your model is saying that it's not sharing any information between data of different categories.

there is even code https://github.com/Lkxz/categorical-kernels and thesis https://upcommons.upc.edu/bitstream/handle/2099.1/24508/99930.pdf?sequence=1 or or https://www.researchgate.net/post/What_kernel_functions_can_be_applied_to_categorical_features However, if you would like to use kernel function for categorical data, I think this package [1] might be helpful. In particular, for categorical data, you could use Aitchison-Aitken kernel [2]. [1] http://socserv.mcmaster.ca/racine/Rjournal.pdf [2] http://biomet.oxfordjournals.org/content/63/3/413.abstract or https://academic.oup.com/biomet/article-abstract/63/3/413/270829

johannfaouzi commented 4 years ago

I'm a bit annoyed by the lack of the literature on this topic, but as the time there is no real way to deal with categorical time series in pyts at this stage.

I will consider adding a kernel module in a future release. It would contain popular kernels for continuous time series such as GAK, and it would be the opportunity to add kernels for categorical time series.