tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 339 forks source link

Is it possible to limit the range of similarity computation? (Dynamic Time Warping) [K-Shape] #370

Open StefanR44 opened 2 years ago

StefanR44 commented 2 years ago

I would like to have DTW only move within a certain range, so that a point will not be matched to another point that is more than x steps away.

I hope this visualization helps: https://i.imgur.com/hlw9iOF.png

Thanks in advance

Picture edited from Paparrizos and Gravano, "k-Shape: Efficient and Accurate Clustering of Time Series" (2016)

DTW range limit !

GillesVandewiele commented 2 years ago

That is most definitely possible. Please check our user guide: https://tslearn.readthedocs.io/en/stable/user_guide/dtw.html

The things you are looking for are called constraints. We support two types: itakura and sakoe_chiba

StefanR44 commented 2 years ago

That is most definitely possible. Please check our user guide: https://tslearn.readthedocs.io/en/stable/user_guide/dtw.html

The things you are looking for are called constraints. We support two types: itakura and sakoe_chiba

Ah, thanks a bunch @GillesVandewiele! I did not realize this was still k-shapes related. I will read into it!

GillesVandewiele commented 2 years ago

Most welcome. I am not sure about how to use it for K-shape myself at the moment. It might require some patching. In case you manage to make it work, please comment here on how you did it!

StefanR44 commented 2 years ago

Will do! I will do some fiddling with K-shapes in some ways. The biggest issue with K-shape for me is, that it requires a dataset of equal-sized time series. For my use-case I am clustering time series of energy consumption and sometimes I do not have them for the whole year but only for a few months. Do you know by chance, if there is a way to do the clustering based on actual dates or months (e.g. summer series only matched with other summer series)?

GillesVandewiele commented 2 years ago

Not immediately from the top of my head... You could cut the longer timeseries into smaller windows as a preprocessing step or pad the shorter ones with noise perhaps? These are rather rudimentary and probably not the ideal solution though.

StefanR44 commented 2 years ago

Yes, those are possible workarounds if everything else fails, I am actually on it right now 👍

StefanR44 commented 2 years ago

Did not really succeed on that, as I realized that K-Shape does not use DTW but SBD. Now the questions is, how can that be adjusted? I was checking the functions for SBD usage, I guess this is where it is used / created. I tried to backtrack y_shifted_sbd_vec. It is a function from tslearn: from tslearn.metrics import cdist_normalized_cc, y_shifted_sbd_vec

Is there a way to find out more about it? When i wanted to go to the definition I just got this: from .cycc import cdist_normalized_cc, y_shifted_sbd_vec

What does it exactly do?


def _shape_extraction(self, X, k):
        sz = X.shape[1]
        Xp = y_shifted_sbd_vec(self.cluster_centers_[k], X[self.labels_ == k],
                               norm_ref=-1,
                               norms_dataset=self.norms_[self.labels_ == k])
        S = numpy.dot(Xp[:, :, 0].T, Xp[:, :, 0])
        Q = numpy.eye(sz) - numpy.ones((sz, sz)) / sz
        M = numpy.dot(Q.T, numpy.dot(S, Q))
        _, vec = numpy.linalg.eigh(M)
        mu_k = vec[:, -1].reshape((sz, 1))

        # The way the optimization problem is (ill-)formulated, both mu_k and
        # -mu_k are candidates for barycenters
        # In the following, we check which one is best candidate
        dist_plus_mu = numpy.sum(numpy.linalg.norm(Xp - mu_k, axis=(1, 2)))
        dist_minus_mu = numpy.sum(numpy.linalg.norm(Xp + mu_k, axis=(1, 2)))
        if dist_minus_mu < dist_plus_mu:
            mu_k *= -1

        return mu_k