tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 342 forks source link

[BUG] `silhouette_score` crashed with large dataset #507

Closed sheraztariq007 closed 7 months ago

sheraztariq007 commented 9 months ago

Describe the bug I am a beginner at Time series analysis and i have used TimeSeriesKmeans, which worked perfectly for my case. But now i want to compute the Silhouette score for determining the K. I used a smaller dataset for the computation of the score and it worked fine, but when computing silhouette_score for a data with 87389 row, the process starts to explode the memory and then it terminates.

To Reproduce

from tslearn.clustering.utils import silhouette_score
score = silhouette_score(X,labels,sample_size=1000,n_jobs=-1,verbose=True)

Environment (please complete the following information):

Additional context As i am from programming background. which is why i did some reverse engineering to the code and i have following findings:

Not related to the issue, but the following code in silhouette_score function is not reachable:

else:
        X_ = to_time_series_dataset(X)
        n, sz, d = X_.shape
        sklearn_X = X_.reshape((n, -1))

        def sklearn_metric(x, y):
            return metric(to_time_series(x.reshape((sz, d)),
                                         remove_nans=True),
                          to_time_series(y.reshape((sz, d)),
                                         remove_nans=True))

Reason: as you mention in documentation and also depicted in code that, if metric=None then it would compute dtw. which makes the else condition unreachable.

in reference to comment: https://github.com/tslearn-team/tslearn/pull/128#discussion_r1492417533

YannCabanes commented 8 months ago

Hello @sheraztariq007, thanks for the issue. The memory issue is not a bug from tslearn, it is due the fact that a large matrix of shape (n_ts, n_ts) is created to store the pairwise distances.

I also have a memory issue on my local computer when I run the following code:

import numpy as np
n_ts = 66000
a = np.zeros((n_ts, n_ts))
print(a.shape)

Then I have the following error message:

numpy.core._exceptions.MemoryError: Unable to allocate 32.5 GiB for an array with shape (66000, 66000) and data type float64

While everything is working fine for when I am using a bit less time series:

import numpy as np
n_ts = 65000
a = np.zeros((n_ts, n_ts))
print(a.shape)

Then the following message is printed:

(65000, 65000)

When I look into my computer parameters, I see that the Memory is equal to 31.0 GiB.

A first possibility to solve your issue is to use a computer or server with more memory.

YannCabanes commented 8 months ago

Also, the else condition of the function silhouette_score is reachable when the input parameter metric is callable. It allows the user to use customized metrics. However, I found an error in this else condition that I fixed in PR https://github.com/tslearn-team/tslearn/pull/508.

In next tslearn version (current version is 0.6.3), the following code:

import numpy as np
from tslearn.clustering.utils import silhouette_score
from tslearn.metrics import dtw
from tslearn.metrics import cdist_dtw

np.random.seed(0)

n_ts = 200
sz = 3
d = 2

X = np.random.randn(n_ts, sz, d)
labels = [0] * (n_ts // 2) + [1] * (n_ts // 2)

"""First method"""
score = silhouette_score(
    X=cdist_dtw(X),
    labels=labels,
    metric="precomputed")

print(score)

"""Second method"""
score = silhouette_score(
    X=X,
    labels=labels,
    metric=dtw)

print(score)

will print two equal scores:

0.0018015434161644396
0.0018015434161644396

In the current tslearn version, the second function raises a recursion error.

This second method to compute the silhouette score might solve your memory issue since tslearn's silhouette_score function is then calling scikit-learn's sihouette_score function which does not create a score matrix of shape (n_ts, n_ts) when n_ts is too large. Instead, it creates several partial score matrices of shape (n_chunk_samples, n_samples), computes the scores for each chunked matrix and then concatenates the scores.