Open bkolb249 opened 4 years ago
Hi @bkolb249
k-means is known to be highly dependent on initialization and it seems this is what you are experiencing here.
Maybe a "reasonable" way would be to use TimeSeriesKMeans with n_init
>1 (eg n_init=10
if you can afford it) so that you can compare good clustering estimators (at least the best one across 10 initializations).
Hi @rtavenar
thank you for your quick reply. This solves it for the CBR dataset from above...but for my original data, even with n_init=20
it get a min value of 0.233
and a max value of 0.523
for n=2
...
Describe the bug When clustering with
TimeSeriesKMeans
,silhouette_score
yields different results even though the configuration (except for the random state obviously) is identical.Problem is, sometimes the silhouette score for
n=2
is higher than the score forn=3
and sometimes the other way around. So it is not possible to use the silhouette score for determination of optimal amount of clusters.Is this expected behavior?
To Reproduce
Repeat the same clustering process (
n=2 clusters
) for 10 times and print silhouette score:Results in
Repeat the above procedure with
n=3
clusters and print silhouette score:Results in
So depending on the result, either
n=2
orn=3
would be selected, but it is not unambiguousExpected behavior I would have expected a a more or less consistent silhouette score (maybe around +- 0.02). At least that the levels stay the same.
Environment (please complete the following information):
Additional context I detected the problem in a different data set which I did not share here. There, the differences were even higher (
n=2
) sometimes returned0.17
(lowest) and0.52
(highest)