Closed sheraztariq007 closed 7 months ago
Hello @sheraztariq007, thanks for the issue.
The memory issue is not a bug from tslearn
, it is due the fact that a large matrix of shape (n_ts, n_ts)
is created to store the pairwise distances.
I also have a memory issue on my local computer when I run the following code:
import numpy as np
n_ts = 66000
a = np.zeros((n_ts, n_ts))
print(a.shape)
Then I have the following error message:
numpy.core._exceptions.MemoryError: Unable to allocate 32.5 GiB for an array with shape (66000, 66000) and data type float64
While everything is working fine for when I am using a bit less time series:
import numpy as np
n_ts = 65000
a = np.zeros((n_ts, n_ts))
print(a.shape)
Then the following message is printed:
(65000, 65000)
When I look into my computer parameters, I see that the Memory is equal to 31.0 GiB.
A first possibility to solve your issue is to use a computer or server with more memory.
Also, the else
condition of the function silhouette_score
is reachable when the input parameter metric
is callable.
It allows the user to use customized metrics.
However, I found an error in this else
condition that I fixed in PR https://github.com/tslearn-team/tslearn/pull/508.
In next tslearn
version (current version is 0.6.3), the following code:
import numpy as np
from tslearn.clustering.utils import silhouette_score
from tslearn.metrics import dtw
from tslearn.metrics import cdist_dtw
np.random.seed(0)
n_ts = 200
sz = 3
d = 2
X = np.random.randn(n_ts, sz, d)
labels = [0] * (n_ts // 2) + [1] * (n_ts // 2)
"""First method"""
score = silhouette_score(
X=cdist_dtw(X),
labels=labels,
metric="precomputed")
print(score)
"""Second method"""
score = silhouette_score(
X=X,
labels=labels,
metric=dtw)
print(score)
will print two equal scores:
0.0018015434161644396
0.0018015434161644396
In the current tslearn
version, the second function raises a recursion error.
This second method to compute the silhouette score might solve your memory issue since tslearn
's silhouette_score
function is then calling scikit-learn
's sihouette_score
function which does not create a score matrix of shape (n_ts, n_ts)
when n_ts
is too large. Instead, it creates several partial score matrices of shape (n_chunk_samples, n_samples)
, computes the scores for each chunked matrix and then concatenates the scores.
Describe the bug I am a beginner at Time series analysis and i have used TimeSeriesKmeans, which worked perfectly for my case. But now i want to compute the Silhouette score for determining the K. I used a smaller dataset for the computation of the score and it worked fine, but when computing
silhouette_score
for a data with 87389 row, the process starts to explode the memory and then it terminates.To Reproduce
Environment (please complete the following information):
Additional context As i am from programming background. which is why i did some reverse engineering to the code and i have following findings:
_cdist_generic
indices = be.triu_indices( len(dataset1), k=0 if compute_diagonal else 1, m=len(dataset1) )
Not related to the issue, but the following code in
silhouette_score
function is not reachable:Reason: as you mention in documentation and also depicted in code that, if
metric=None
then it would computedtw
. which makes theelse
condition unreachable.in reference to comment: https://github.com/tslearn-team/tslearn/pull/128#discussion_r1492417533