tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.88k stars 336 forks source link

A general question on TimeSeriesKMeans(TSK) #286

Open NimaSarajpoor opened 4 years ago

NimaSarajpoor commented 4 years ago

Hello tslearn community,

I was wondering how TSK works with DTW distance. In Euclidean Distance, we know that the goal is to minimize the sum of squares of distances to centers. Right?

However, how it works in DTW? After we initialize the starting points (centroids) for clustering, we assign each sample "x" to its closest centroid.

The distance is: d = DTW(x, centroid)

So, it doesn't matter whether we use d or d**2. Right? But, what happens in the final result?

Are clusters having a minimum sum of squares OR sum of absolutes? In the doc, it says: inertia_: Sum of distances of samples to their closest cluster center.

However, in the source code, it says: def _compute_inertia(distances, assignments, squared=True): "Derive inertia (average of squared distances) from pre-computed distances and assignments.

The reason behind my question is that I am trying to implement Hierarchical clustering using DTW distance, and I am trying to understand whether I should consider "the sum of absolutes" or "the sum of squared of distances" as the cost for "ward" linkage. It should make a difference in the result. right?

Maybe I am missing something here. It would be great if someone can help me understand this matter or provide me with some references.

Best, NIma

NimaSarajpoor commented 4 years ago

I think I found my answer. Since, DBA is calculated based on minimizing sum of squares, we should use the square in the ward for DTW. However, I still don't understand why inertia_ is the sum of distances (to centers).

GillesVandewiele commented 4 years ago

I don't think it matters whether you take the squared or absolute value. When calculating manhattan or euclidean distances, you take the point-wise differences. Taking squares there will give more weight to large point-wise differences.

DTW is already an aggregated distance, so squaring that should have no effect (the most similar pair (which you take for hierarchical clustering) will stay the most similar after squaring the distances).

You could try both and see whether there's an actual difference in the results?

The _compute_intertia you mention is used within the clustering module to check for convergence.

NimaSarajpoor commented 4 years ago

Thanks for the response. The result is different. I can provide my code if you think it is necessary to see whether I am making a mistake or not.

Please let me elaborate a little bit.

I want to use the HC ward "concept", but with DTW distance instead of euclidean distance. So, the linkage distance between two groups of data points G1 = {ts_1, ts_2, ...., ts_nG1}, with dba_center ts_g1, and G2 = {ts_1, ts_2, ...., ts_nG2} with dba_center ts_g2.

is:

linkage distance = cost(G1 + G2) - cost(G1) - cost(G2)

where,

cost (G) = sum(dtw(ts_1, dba_center) + ... + dtw(ts_nG, dba_center)) OR cost (G) = sum(dtw(ts_1, dba_center)^2 + ... + dtw(ts_nG, dba_center)^2)

As you can see, we can consider the dtw distance of each member to the centroid, or the square of such distance. And, the result is different. Since the typical formula for "ward" distance (for euclidean distance) as mentioned in this link http://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf and the way DBA is calculated in tslearn (which is minimizing dtw^2), I am saying IF I want to implement ward for dtw, I should better use dtw^2.

GillesVandewiele commented 4 years ago

Ok that could indeed lead to different results. I think the choice is arbitrary (and how the DBA was found doesn't matter that much). The sum of squares will be more sensitive to outliers.

So in short, taking squares seems OK to me!

NimaSarajpoor commented 4 years ago

Thank Gilles for your input. It's good to get some sort of confirmation :)