tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.84k stars 331 forks source link

Questions about documentation #143

Closed Amelie10 closed 4 years ago

Amelie10 commented 4 years ago

I have some problems to understand the documentation, we can read "Then, if we want to manipulate sets of time series, we can cast them to three-dimensional arrays, using to_time_series_dataset.

If time series from the set are not equal-sized, NaN values are appended to the shorter ones and the shape of the resulting array is (n_ts, max_sz, d) where max_sz is the maximum of sizes for time series in the set.” I think n_ts = number of time series, and d = dimension, but what is the meaning? I mean, d is the number of variables or features, and n_ts is the number of samples, right?

Secondly, I tried to use TimeSeriesKMeans with a parallelepiped which has the dimension (2, 53218, 331) but I obtain a MemoryError, whereas my server has 16Go RAM and max_iter = 5. Is it normal?

Thanks for your answer,

BR,

Amélie

rtavenar commented 4 years ago

Dear @Amelie10

The meaning is the following:

Hence, a dataset of shape (3, 10, 2) is a dataset made of 3 bivariate time series of length 10, for example.

rtavenar commented 4 years ago

Secondly, I tried to use TimeSeriesKMeans with a parallelepiped which has the dimension (2, 53218, 331) but I obtain a MemoryError, whereas my server has 16Go RAM and max_iter = 5. Is it normal?

There is probably a misunderstanding in the dimensions here, since the shape you give correspond to a dataset of 2 time series, hence I do not see why you would like to perform clustering on such data.

Amelie10 commented 4 years ago

Dear @Amelie10 The meaning is the following:

n_ts is the number of time series in the dataset sz is the length (number of timestamps) of the time series d is the dimensionality (number of modalities) of the time series

Hence, a dataset of shape (3, 10, 2) is a dataset made of 3 bivariate time series of length 10, for example.

You mean: you have 2 samples (I mean 2 numpy arrays with the dimension 2), where each column represents a variables/features/time series (here, you have only 3 columns), and each row represents a time step (here, you have only 10 rows), right ?

A concret example is a sample about people in Europa and an other one about people in Africa, during an athletics event, like running. The 3 variables are SPEED, HEART RATE and BODY TEMPERATUR, and you have a value about these all 10s for instance, right?

Secondly, it's just a test. But it doesn't work.

rtavenar commented 4 years ago

In the example above, we have 3 runners with 2 variables recorded (Speed, heart rate) at 10 different timestamps (t=1, ..., 9, 10).

And this is stored in a 3-dimensional array:

>>> data.shape
(3, 10, 2)
Amelie10 commented 4 years ago

Yes, I said the opposite of what I meant.

And for you it is normal if my test doesn't work? 16Go isn't enough?

rtavenar commented 4 years ago

If you do clustering, that means you want to group your time series in clusters, but here you have only two time series, so there is no need for clustering.

Amelie10 commented 4 years ago

Yes of course, but if you tried to do it with a very small dataset, you obtain a result.

For example: sample_with_two_channel_1 = [[4,4], [5,9], [6,2], [2,1], [1,7], [9,3], [4,6]] sample_with_two_channel_2 = [[4,5], [9,9], [2,6], [1,6], [8,8], [3,1], [6,1]] sample_with_two_channel_3 = [[7,5], [5,7], [1,6], [6,6], [5,8], [5,2], [2,1]] sample_with_two_channel_4 = [[5,5], [9,9], [6,6], [6,8], [8,8], [2,4], [1,1]] sample_with_two_channel_5 = [[1,7], [2,8], [1,9], [1,3], [2,4], [3,1], [2,3]] sample_with_two_channel_6 = [[7,4], [8,9], [9,5], [5,9], [4,7], [1,8], [3,1]]

test = to_time_series_dataset([sample_with_two_channel_1, sample_with_two_channel_2, sample_with_two_channel_3, sample_with_two_channel_4, sample_with_two_channel_5, sample_with_two_channel_6 ])

test.shape (6, 7, 2)

kmeans_clusters = TimeSeriesKMeans(n_clusters=2, metric="dtw", max_iter=5, max_iter_barycenter=5, verbose=False, random_state=0, init="k-means++").fit(test)

kmeans_clusters.labels_ array([0, 1, 1, 1, 0, 1], dtype=int64)

rtavenar commented 4 years ago

Yes, that seems normal, no? You try to cluster 6 time series into 2 groups, that is feasible.

I don't get your point here.

Amelie10 commented 4 years ago

Yes it seems normal, but if you try with a dataset of shape (2, 53000, 331) it isn't enough to have 16Go RAM. Is it normal?

rtavenar commented 4 years ago

Time complexity for a single DTW computation is O(sz^2 * d). Space complexity for a single DTW computation is O(sz^2). In practice, one has to fill a sz x sz matrix. For float32 matrices, a (53000, 53000) matrix should then take: 53000 53000 32 bits = 53000 53000 4 bytes = more or less 10GB (if I did not mess up my computations)

Then, for TimeSeriesKMeans, many such computations are performed (in parallel if n_jobs is not None or 1): n_ts * n_barycenters * n_iterations. So it is not surprising that it takes very long imo.

Amelie10 commented 4 years ago

OK, I see. And this computation is just only when d = 1 so I really need more Go!

rtavenar commented 4 years ago

Memory usage for DTW is independent of the dimension (except from the memory used to store time series themselves of course).