Silhouette Score takes close to exponential time when using dtw/soft-dtw

meghjoshii commented 2 years ago

Hi, I am trying to train my model on 25,000 rows of time series data with each row having data for over 60 time-intervals. It is taking 2+ hours to fit and predict using kMeans & soft-dtw as distance metric. However, since I need to find optimal number of clusters, I need to run the algorithm for clusters 3-10 and calculate silhouette score each time.

Calculation of silhouette score is taking 3+ hours. Is there a way to speed this up? Also, how can I speed up this entire process? How can I parallelize both soft-dtw and silhouette score?

Any suggestions would be much appreciated.

Thanks!

NimaSarajpoor commented 2 years ago

@meghjoshii I was wandering and saw your question. Have you tried to calculate distance matrix first, and then fed it to silhouttescore function? The number of observations is huge, so I am not sure even providing pre-computed distance matrix help. So, you can do try a randomly-generated (symmetric) distance matrix of that size (i.e. 25000-by-25000) and give it to silhoutte score. And, see how much it takes. If you can get it in reasonable amount of time, then go ahead and store pairwise DTW distances in distance matrix.

Again, your time series is huge. So, you may get into trouble in calculating dtw distance. you may speed up things by using smaller window size (radius), e.g. a good $r$ is usually less than 10% of length of your time series sequence.

NimaSarajpoor commented 2 years ago

@meghjoshii I just noticed you said K-Means. You may want to take a look at K-Medoids as it accepts distance matrix for clustering!

meghjoshii commented 2 years ago

Thanks, Nima. Will try these out.

I was wondering if there is a way that I can preserve computations (distance matrix) carried out during clustering (i.e., when I run y_pred = sdtw_km.fit_predict(df)). If so, can I pass this to silhouette score in tslearn? Thanks!

rtavenar commented 2 years ago

@meghjoshii check the metric argument in the docs for silhouette_score, one valid value is "precomputed" which corresponds to your use case.

NimaSarajpoor commented 2 years ago

@meghjoshii (@rtavenar: please correct me if I am wrong)

I think sdtw_km needs to calculate the centroid of each cluster in each iteration. So, it takes time when the metric is NOT Euclidean distance. That is why I suggested using K-Medoids. (I had the same issue myself) K-Medoids (in scikit-learn-extra) accepts distance matrix and the centroid is one of the observations (so, no need to calculate new centroid).

So, use tslearn to get distance matrix with sdtw metric, and then feed it to K-Medoids for clustering.

If you can get your clusters in reasonable time with y_pred = sdtw_km.fit_predict(df)), then please ignore my suggestion.

meghjoshii commented 2 years ago

Thanks for your responses. Amazing job with this package!

@rtavenar - yes, I saw that metric argument, however, that would require me to recompute cdist_dtw(new_df) again, which is time consuming.

I am currently running this:

dba = TimeSeriesKMeans( n_clusters = 6, n_jobs = -1, metric="dtw", verbose=True, random_state=seed ) y_pred = dba.fit_predict( new_df )

This takes close to an hour to compute, which is reasonable given my use case.

After this, I would like to calculate the silhouette score, which means I would have to run this, which means I would have to run: silhouette_score(cdist_dtw(new_df), y_pred, metric="precomputed") or silhouette_score(new_df, y_pred, metric = "dtw"). Correct me if I am wrong, but in both these cases the distance matrix is re-computed.

I would like to retrieve the distance matrix computed during fit_predict and pass it to silhouette score to speed up the process. Kindly let me know if this is possible and how to go about it.

Thank you so much! I appreciate your help & suggestions.

NimaSarajpoor commented 2 years ago

@meghjoshii Just sharing my thoughts: To the best of my knowledge, K-Means updates its cluster centroids throughout the fitting process, and, in each iteration, it calculates the distance between observations and the cluster centroids. so, it does NOT calculate the distance BETWEEN OBSERVATIONS.

So, if you want to use silhouette score, you need distance matrix, which means you need to calculate the pairwise dtw distance for C(25000, 2) number of cases, where C stands for combination. I do not think K-Means calculate that throughout the process.

Advice: You might be better off if you check for other clustering validation metrics that work with the distance between observation and centroids. Then, you just need to do pairwise calculation for 25000 * 6 cases (when n_clusters=6).

meghjoshii commented 2 years ago

Thanks @Nima. Maybe I could use Inertia & Elbow Curves instead since that uses the distance between observations and cluster centroids. Even then, how can I extract that distance matrix from kmeans in tslearn?

NimaSarajpoor commented 2 years ago

I believe you can get that from your fitted model. You may want to take a look at the document:

https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.TimeSeriesKMeans.html

And look for inertia

rtavenar commented 2 years ago

Thanks for your responses. Amazing job with this package!

@rtavenar - yes, I saw that metric argument, however, that would require me to recompute cdist_dtw(new_df) again, which is time consuming.

I am currently running this:

dba = TimeSeriesKMeans( n_clusters = 6, n_jobs = -1, metric="dtw", verbose=True, random_state=seed ) y_pred = dba.fit_predict( new_df )

This takes close to an hour to compute, which is reasonable given my use case.

After this, I would like to calculate the silhouette score, which means I would have to run this, which means I would have to run: silhouette_score(cdist_dtw(new_df), y_pred, metric="precomputed") or silhouette_score(new_df, y_pred, metric = "dtw"). Correct me if I am wrong, but in both these cases the distance matrix is re-computed.

I would like to retrieve the distance matrix computed during fit_predict and pass it to silhouette score to speed up the process. Kindly let me know if this is possible and how to go about it.

Thank you so much! I appreciate your help & suggestions.

Well in this case y_pred is your matrix of distances between the series and the centroids so there is no need to re compute this

meghjoshii commented 2 years ago

y_pred just gives me the cluster numbers, how can I extract the distance matrix? (distance between each time series and its cluster center)?

Thanks!

NimaSarajpoor commented 2 years ago

@meghjoshii

Approach 1: create a git branch and modify the module so it can return the last distance matrix at the end of process, which is the distance matrix D that contains the distances between observations (n_samples) and cluster centers (n_cluster): It would be a n_sample-by-n_cluster matrix (2D numpy array). D[ij] is distance between i-th observation and j-th center.
Approach 2: Just do it yourself! So, you have observations data set (n_samples-by-L) and centroids (n_clusters-by-L), where L is the length of each time series. Then, if you are using dtw, you can use cdist_dtw.

P.S. You should usually start with a few points (e.g. 5) and maybe two centroids on a piece of paper, and think about the thing that you want to achieve. If you can understand the problem at a smaller scale, you can then think about bigger cases like yours. So, what you want to calculate is simply the pair-wise distance between two sets of data points (one set is observations and one set is centroids). Then, you need to think what information you need to calculate it: So, you need the centroids data (not just the labels but the centroid time series data). Then, you need to read the docs to see what information you can obtain from the fitted model (e.g. check out attributes section in kmeans Then, you need to do some search on google or library to see if there is a module that can help you get what you want. Or, vice versa: first look for a module that does what you want and then see what input it requires. Then, try to see if you can find those information from the module you are using.

I provided hyper links in my repsonse so you can click on them.

tslearn-team / tslearn

Silhouette Score takes close to exponential time when using dtw/soft-dtw #408