tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.92k stars 342 forks source link

N-dimensional features issue in the method #496

Open zandarina1 opened 1 year ago

zandarina1 commented 1 year ago

Hello all,

I want to use two dimensions, two time series for each participant. I transform the data as expected by the library

(6431, 5, 2)

However if I plot it, it puts together both signals in one single plot, I am not sure if they are considering the features separatelly, that this is what i want for example participant 1 with series A increasing and series B dicreaseing is cluster 1. But what I get, it does not make sense, it makes the same as if it was in one dimension and if I plot it, it does not make sense by separating X_train[y_pred == yi,:,1] or X_train[y_pred == yi,:,0], and the cluster centers are the same for both series /dims. How can I plot when I have two dimensions and make the clusters differentiate by dimensions?. It would be great to show an example with multiple dimensions apart from the nice examples of the tutorial, Thanks

for yi in range(N_CLUSTERS):
    plt.subplot(2, 3, 1 + yi)
    for xx in X_train[y_pred == yi,:,1]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(km.cluster_centers_[yi,:,1].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-4, 4)
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
    if yi == 1:
        plt.title("DTW $k$-means")
YannCabanes commented 11 months ago

Hello @zandarina1, I think that your problem comes from the misuse of numpy.ravel which flattens the NumPy arrays: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html#numpy.ravel

YannCabanes commented 11 months ago

Taking inspiration from: https://tslearn.readthedocs.io/en/stable/auto_examples/clustering/plot_kmeans.html#sphx-glr-auto-examples-clustering-plot-kmeans-py I have written the following code:

import numpy
import matplotlib.pyplot as plt
import numpy as np

from tslearn.clustering import TimeSeriesKMeans
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance, \
    TimeSeriesResampler

seed = 0
numpy.random.seed(seed)
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
print(X_train.shape)  # (100, 275, 1)
X_train = np.concatenate([X_train, - X_train], axis=2)
print(X_train.shape)  # (100, 275, 2)
X_train = X_train[y_train < 4]  # Keep first 3 classes
numpy.random.shuffle(X_train)
# Keep only 50 time series
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train[:50])
# Make time series shorter
X_train = TimeSeriesResampler(sz=40).fit_transform(X_train)
sz = X_train.shape[1]
print(sz)

# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(n_clusters=3,
                           metric="softdtw",
                           metric_params={"gamma": .01},
                           verbose=True,
                           random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)

for yi in range(3):
    for di in range(2):
        plt.subplot(2, 3, 1 + yi + 3 * di)
        for xx in X_train[y_pred == yi]:
            plt.plot(xx[:, di], "k-", alpha=.2)
        plt.plot(sdtw_km.cluster_centers_[yi, :, di], "r-")
        plt.xlim(0, sz)
        plt.ylim(-4, 4)
        plt.text(0.05, 0.85, f"Cluster {yi + 1}, dim {di + 1}",
                 transform=plt.gca().transAxes)
        if yi == 1 and di == 0:
            plt.title("Soft-DTW $k$-means")

plt.tight_layout()
plt.show()

Does it correspond to what you would like to do?