Closed Amelie10 closed 4 years ago
Dear @Amelie10
The meaning is the following:
n_ts
is the number of time series in the datasetsz
is the length (number of timestamps) of the time seriesd
is the dimensionality (number of modalities) of the time seriesHence, a dataset of shape (3, 10, 2)
is a dataset made of 3 bivariate time series of length 10, for example.
Secondly, I tried to use TimeSeriesKMeans with a parallelepiped which has the dimension (2, 53218, 331) but I obtain a MemoryError, whereas my server has 16Go RAM and max_iter = 5. Is it normal?
There is probably a misunderstanding in the dimensions here, since the shape you give correspond to a dataset of 2 time series, hence I do not see why you would like to perform clustering on such data.
Dear @Amelie10 The meaning is the following:
n_ts is the number of time series in the dataset sz is the length (number of timestamps) of the time series d is the dimensionality (number of modalities) of the time series
Hence, a dataset of shape (3, 10, 2) is a dataset made of 3 bivariate time series of length 10, for example.
You mean: you have 2 samples (I mean 2 numpy arrays with the dimension 2), where each column represents a variables/features/time series (here, you have only 3 columns), and each row represents a time step (here, you have only 10 rows), right ?
A concret example is a sample about people in Europa and an other one about people in Africa, during an athletics event, like running. The 3 variables are SPEED, HEART RATE and BODY TEMPERATUR, and you have a value about these all 10s for instance, right?
Secondly, it's just a test. But it doesn't work.
In the example above, we have 3 runners with 2 variables recorded (Speed, heart rate) at 10 different timestamps (t=1, ..., 9, 10).
And this is stored in a 3-dimensional array:
>>> data.shape
(3, 10, 2)
Yes, I said the opposite of what I meant.
And for you it is normal if my test doesn't work? 16Go isn't enough?
If you do clustering, that means you want to group your time series in clusters, but here you have only two time series, so there is no need for clustering.
Yes of course, but if you tried to do it with a very small dataset, you obtain a result.
For example:
sample_with_two_channel_1 = [[4,4], [5,9], [6,2], [2,1], [1,7], [9,3], [4,6]] sample_with_two_channel_2 = [[4,5], [9,9], [2,6], [1,6], [8,8], [3,1], [6,1]] sample_with_two_channel_3 = [[7,5], [5,7], [1,6], [6,6], [5,8], [5,2], [2,1]] sample_with_two_channel_4 = [[5,5], [9,9], [6,6], [6,8], [8,8], [2,4], [1,1]] sample_with_two_channel_5 = [[1,7], [2,8], [1,9], [1,3], [2,4], [3,1], [2,3]] sample_with_two_channel_6 = [[7,4], [8,9], [9,5], [5,9], [4,7], [1,8], [3,1]]
test = to_time_series_dataset([sample_with_two_channel_1, sample_with_two_channel_2, sample_with_two_channel_3, sample_with_two_channel_4, sample_with_two_channel_5, sample_with_two_channel_6 ])
test.shape
(6, 7, 2)
kmeans_clusters = TimeSeriesKMeans(n_clusters=2, metric="dtw", max_iter=5, max_iter_barycenter=5, verbose=False, random_state=0, init="k-means++").fit(test)
kmeans_clusters.labels_
array([0, 1, 1, 1, 0, 1], dtype=int64)
Yes, that seems normal, no? You try to cluster 6 time series into 2 groups, that is feasible.
I don't get your point here.
Yes it seems normal, but if you try with a dataset of shape (2, 53000, 331) it isn't enough to have 16Go RAM. Is it normal?
Time complexity for a single DTW computation is O(sz^2 * d)
.
Space complexity for a single DTW computation is O(sz^2)
. In practice, one has to fill a sz x sz
matrix. For float32 matrices, a (53000, 53000) matrix should then take:
53000 53000 32 bits = 53000 53000 4 bytes = more or less 10GB (if I did not mess up my computations)
Then, for TimeSeriesKMeans
, many such computations are performed (in parallel if n_jobs
is not None or 1): n_ts * n_barycenters * n_iterations
. So it is not surprising that it takes very long imo.
OK, I see. And this computation is just only when d = 1 so I really need more Go!
Memory usage for DTW is independent of the dimension (except from the memory used to store time series themselves of course).
I have some problems to understand the documentation, we can read "Then, if we want to manipulate sets of time series, we can cast them to three-dimensional arrays, using to_time_series_dataset.
If time series from the set are not equal-sized, NaN values are appended to the shorter ones and the shape of the resulting array is (n_ts, max_sz, d) where max_sz is the maximum of sizes for time series in the set.” I think n_ts = number of time series, and d = dimension, but what is the meaning? I mean, d is the number of variables or features, and n_ts is the number of samples, right?
Secondly, I tried to use TimeSeriesKMeans with a parallelepiped which has the dimension (2, 53218, 331) but I obtain a MemoryError, whereas my server has 16Go RAM and max_iter = 5. Is it normal?
Thanks for your answer,
BR,
Amélie