KMeans DTW: Inertia increases with more clusters

tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python

https://tslearn.readthedocs.io

BSD 2-Clause "Simplified" License

2.88k stars 336 forks source link

KMeans DTW: Inertia increases with more clusters #306

Open WhiteLin3s opened 3 years ago

WhiteLin3s commented 3 years ago

Hey there,

I've got the following problem: I'm trying to identify the optimal number of clusters for my data. Therefore I run the KMeans (DTW) algorithm with different values for "n_clusters", while not changing any other parameter or my data.

I focus on the inertia value and use the "elbow method" to identify the optimal number of clusters. I would expect inertia to decrease every time, when I increase "n_clusters" ceteris paribus. But at some steps, inertia seems to increase even though I use more clusters. I already tried playing around with the paramters, like setting a random seed, varying the initialization of the clusters etc. But I can't manage to get a strictly decreasing inertia value.

Is there an explanation for this behaviour? Does it have sth. to do with the DTW method and Barycenter Clustering? I'm new to time series clustering and would appreciate any help, since I use this package for my ongoing thesis.

Thanks in advance & Greetings!

rtavenar commented 3 years ago

Hi @WhiteLin3s and thanks for your question,

One possible explanation is related to initialization: the methods involved are known to converge to local minima, hence it is possible that this causes the non-decreasing behaviour you observe. To alleviate this, you could probably define the initial barycenters for n_clusters=k+ as the final ones from n_clusters=k plus one randomly positionned centroid: in this case, you should always observe that a larger number of cluster decreases the kmeans cost.

Best regards, Romain

WhiteLin3s commented 3 years ago

Hey @rtavenar,

thank you very much for the clear explanation and suggestion how to fix the problem. Sounds like this should work, will definitely try it out later!

Best regards

tatianameleshko commented 3 years ago

Having the same issue, can you please provide code example how to initialize barycenters for n_clusters=k+ as the final ones from n_clusters=k plus one randomly positionned centroid? Should we use "init" parameter with "clustercenters " attribute? Thank you

rtavenar commented 3 years ago

I would do something like this:

from tslearn.clustering import TimeSeriesKMeans
from tslearn.datasets import CachedDatasets
from tslearn.utils import to_time_series_dataset
import numpy

X = CachedDatasets().load_dataset("Trace")[0]

models = []
km = TimeSeriesKMeans(n_clusters=2, metric="dtw", max_iter=10)
km.fit(X)
models.append(km)

for k in range(3, 6):
    idx_new_center = numpy.random.choice(X.shape[0])
    barycenters_init = numpy.vstack((models[-1].cluster_centers_, X[numpy.newaxis, idx_new_center]))
    km = TimeSeriesKMeans(n_clusters=k, metric="dtw", max_iter=10, init=barycenters_init)
    km.fit(X)
    models.append(km)

for km in models:
    print(km.inertia_)

tatianameleshko commented 3 years ago

Thank you! It works with "dtw", for "softdtw" - still seeing increasing trend, but its fine for me - just switched to "dtw".

zhouchengcom commented 3 years ago

Is there a code example for identify the optimal number of clusters ？thinks @tatianameleshko

GillesVandewiele commented 3 years ago

You can use this but replace the sklearn KMeans & silhouette score by those of tslearn.

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

Basically: try different k's and measure the silhouette_score. Take the best one.