tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 339 forks source link

inconsistent / ambiguous values for silhouette_score for identical configuration of TimeSeriesKMeans #278

Open bkolb249 opened 4 years ago

bkolb249 commented 4 years ago

Describe the bug When clustering with TimeSeriesKMeans, silhouette_score yields different results even though the configuration (except for the random state obviously) is identical.

Problem is, sometimes the silhouette score for n=2 is higher than the score for n=3 and sometimes the other way around. So it is not possible to use the silhouette score for determination of optimal amount of clusters.

Is this expected behavior?

To Reproduce

from tslearn.datasets import UCR_UEA_datasets
from tslearn.clustering import TimeSeriesKMeans, silhouette_score

X_train, y_train, X_test, y_test = ds.load_dataset("CBF")

Repeat the same clustering process (n=2 clusters) for 10 times and print silhouette score:

for i in range(2,11):
    km = TimeSeriesKMeans(n_clusters=2, metric="dtw", max_iter=500, random_state=i*2)
    km.fit(X_train)
    print(silhouette_score(X_train,
                           km.predict(X_train),
                           metric="dtw"))

Results in

0.1379816483983626
0.12501642312525266 # lowest
0.19600175472468492
0.21242253672651584 # highest
0.16945987450892622
0.1696772853353848
0.1924580995443414
0.17290448200720934
0.14888132304396023

Repeat the above procedure with n=3 clusters and print silhouette score:

for i in range(2,11):
    km = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=500, random_state=i*2)
    km.fit(X_train)
    print(silhouette_score(X_train,
                           km.predict(X_train),
                           metric="dtw"))

Results in

0.20512134769791557
0.23869052731302476 # highest
0.1988968698176969
0.210215409064512
0.21938781768333507
0.20512134769791557
0.23869052731302476 # highest
0.17423182636957274
0.10851683338892922 # lowest

So depending on the result, either n=2 or n=3 would be selected, but it is not unambiguous

Expected behavior I would have expected a a more or less consistent silhouette score (maybe around +- 0.02). At least that the levels stay the same.

Environment (please complete the following information):

Additional context I detected the problem in a different data set which I did not share here. There, the differences were even higher (n=2) sometimes returned 0.17 (lowest) and 0.52 (highest)

rtavenar commented 4 years ago

Hi @bkolb249

k-means is known to be highly dependent on initialization and it seems this is what you are experiencing here. Maybe a "reasonable" way would be to use TimeSeriesKMeans with n_init>1 (eg n_init=10 if you can afford it) so that you can compare good clustering estimators (at least the best one across 10 initializations).

bkolb249 commented 4 years ago

Hi @rtavenar

thank you for your quick reply. This solves it for the CBR dataset from above...but for my original data, even with n_init=20 it get a min value of 0.233 and a max value of 0.523 for n=2...