tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 342 forks source link

Unexpected result for KShape clustering multivariate timeseries #288

Open DanielDondorp opened 4 years ago

DanielDondorp commented 4 years ago

Describe the bug When clustering multivariate timeseries, KShapes returns the same cluster center for each dimension. When I generate 3-dimensional timeseries of 8 catogeries, TimeSeriesKMeans finds these nicely: TimeSeriesKMeans(max_iter=100, n_clusters=8, n_init=5)_3dims_8clusters_example

KShapes, however, returns the same cluster center for each dimension, and generally fails to separate between the different combinations of motifs: KShape(n_clusters=8, n_init=5)_3dims_8clusters_example

To Reproduce

import numpy as np
import matplotlib.pyplot as plt
from tslearn.clustering import KShape, TimeSeriesKMeans
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

def make_test_data(motif_length = 100):
    n_dims = 3
    motif1 = np.sin(np.linspace(0, np.pi, motif_length))
    motif2 = np.sin(np.linspace(0, 2*np.pi, motif_length))
    data = np.random.uniform(size = (motif_length, n_dims))
    for dim in range(n_dims):
        if np.random.random() <= 0.5:
            data[:,dim] += motif1
        else:
            data[:,dim] += motif2
    return data.T    

def plot_output(model, name):
    #create a figure
    fig, ax = plt.subplots(3,8, sharex = True, sharey = True, figsize = (10,10))
    fig.subplots_adjust(hspace = 0, wspace = 0)
    fig.suptitle(name)

    #set ylabels
    for i in range(3):
        ax[i,0].set_yticks([])
        ax[i,0].set_ylabel(f"dim {i}")

    #plot each motif on the axes associated with the cluster label.
    for motif, l in zip(motifs, model.labels_):
        for i, trace in enumerate(motif.T):
            ax[i,l].plot(trace, alpha = 0.05, lw = 0.2, c = "gray")

    #plot the cluster centers on top in red
    for i, center in enumerate(model.cluster_centers_):
        ax[0, i].set_title(f"cluster {i}")
        for j in range(np.min(center.shape)):
            ax[j,i].plot(center[:,j], c = "r", alpha = 0.8, lw = 0.5)

#generate test data
motifs = np.dstack([make_test_data() for _ in range(1000)]).T

#fit TimeSeriesKMeans model for comparison
kmeans = TimeSeriesKMeans(n_clusters = 8, n_init = 5, max_iter = 100)
kmeans.fit(motifs)

#fit KShape model
kshapes = KShape(n_clusters = 8, n_init = 5, max_iter = 100)
kshapes.fit(motifs)

models = {"kmeans": kmeans,
          "kshapes": kshapes}
#Visualize the models
for model in models.keys():
    plot_output(models[model], model)

Expected behavior I expect KShapes to approximate representative shapes in each dimension, and that each dimension within a cluster can have its own unique cluster center.

Environment (please complete the following information):

kushalkolar commented 4 years ago

This may not fully solve your issue but might be relevant to try/look at: #131

DanielDondorp commented 4 years ago

Thank you. That seems to confirm my intuition that the KShape algorithm does not easily extend to multivariate data. I kind of assumed that it somehow could work, since the fit method accepts the multidimensional data and returns something without any errors.

DanielDondorp commented 4 years ago

If someone could confirm that this is in fact just me not understanding the algorithm, rather than an unexpected result, I can close the issue.

oustella commented 3 years ago

I just ran into the same issue:

KShapes, however, returns the same cluster center for each dimension.

The original paper by Paparrizos et al. seems to have focused on 1d time series only.

To answer the question how KShape handles multivariate ts inputs, I think the following codes (here line 118) may be relevant:

def _shape_extraction(self, X, k):
    sz = X.shape[1]
    Xp = y_shifted_sbd_vec(self.cluster_centers_[k], X[self.labels_ == k],
                           norm_ref=-1,
                           norms_dataset=self.norms_[self.labels_ == k])
    S = numpy.dot(Xp[:, :, 0].T, Xp[:, :, 0])

, in which S takes the first feature from the dimension axis, assuming Xp has the shape of [num_samples, num_timesteps, num_features]. To understand how Xp is calculated, the implementation of y_shifted_sbd_vec can be found here, and by extension the core function normalized_cc. On line 37:

  `cc = numpy.real(numpy.fft.ifft(numpy.fft.fft(s1, fft_sz, axis=0) *
                                     numpy.conj(numpy.fft.fft(s2, fft_sz, axis=0)), axis=0))`

It seems the discrete Fourier transform was done by axis=0, and s1 has the shape [num_timesteps, num_features]. Therefore, each individual feature is transformed individually. Next, the function returns

  `return numpy.real(cc).sum(axis=-1) / denom`

, which seems to me that the calculated cc is summed up over all features (axis=-1). Going back to the beginning of the discussion, I think this is why the cluster center is the same across all features.

Just my two cents... Would love to hear what the author @rtavenar thinks.

DanielDondorp commented 3 years ago

Thanks for your thorough comment, Oustella.

Perhaps KShapes should not accept multidimensional timeseries as input then?

mstickle commented 3 years ago

I am curious if there has been a resolution over this issue or plans on how to handle this case?

Naively, is it possible to take the cross-correlation over multiple sources? Or even apply cross-correlation independently for each source so that each source can weigh into the shape based detection for each cluster?

If not, it seems like the code should throw a warning if someone passes a 3D matrix into the k-shape code so that users know that the output might not be what they expect.

Love using the code, thank you for all the hard work!

ceydaakbulut95 commented 2 years ago

Hello. I executed K-shape, Kmeans and Kernel K-Means methods for multivariate ts dataset. But, K-shape doesn`t seem so efficient, at least for this data structure. Any help will be appreciated!

Last but not least, thank you for your all effort!

ArithmeticError commented 5 months ago

I tested solving the eigenvector mu_k corresponding to the maximum eigenvalue for each dimension separately, and the results are as follows. Figure_1 The specific code changes are to modify the content of the _shape_extraction function to the following:

    def _shape_extraction(self, X, k):
        sz = X.shape[1]
        d = X.shape[2]
        Xp = y_shifted_sbd_vec(self.cluster_centers_[k], X[self.labels_ == k],
                               norm_ref=-1,
                               norms_dataset=self.norms_[self.labels_ == k])
        mu_k_list = []
        for i in range(d):
            S = numpy.dot(Xp[:, :, i].T, Xp[:, :, i])
            Q = numpy.eye(sz) - numpy.ones((sz, sz)) / sz
            M = numpy.dot(Q.T, numpy.dot(S, Q))
            _, vec = numpy.linalg.eigh(M)
            mu_k = vec[:, -1].reshape((sz, 1))
            # The way the optimization problem is (ill-)formulated, both mu_k and
            # -mu_k are candidates for barycenters
            # In the following, we check which one is best candidate
            dist_plus_mu = numpy.sum(numpy.linalg.norm(Xp[:,:,i].reshape(-1,sz,1) - mu_k, axis=(1, 2)))
            dist_minus_mu = numpy.sum(numpy.linalg.norm(Xp[:,:,i].reshape(-1,sz,1) + mu_k, axis=(1, 2)))
            if dist_minus_mu < dist_plus_mu:
                mu_k *= -1
            mu_k_list.append(mu_k)
        mu_k = numpy.hstack(mu_k_list)

        return mu_k

Hope that helps!