Open DanielDondorp opened 4 years ago
This may not fully solve your issue but might be relevant to try/look at: #131
Thank you. That seems to confirm my intuition that the KShape algorithm does not easily extend to multivariate data. I kind of assumed that it somehow could work, since the fit method accepts the multidimensional data and returns something without any errors.
If someone could confirm that this is in fact just me not understanding the algorithm, rather than an unexpected result, I can close the issue.
I just ran into the same issue:
KShapes, however, returns the same cluster center for each dimension.
The original paper by Paparrizos et al. seems to have focused on 1d time series only.
To answer the question how KShape handles multivariate ts inputs, I think the following codes (here line 118) may be relevant:
def _shape_extraction(self, X, k):
sz = X.shape[1]
Xp = y_shifted_sbd_vec(self.cluster_centers_[k], X[self.labels_ == k],
norm_ref=-1,
norms_dataset=self.norms_[self.labels_ == k])
S = numpy.dot(Xp[:, :, 0].T, Xp[:, :, 0])
, in which S
takes the first feature from the dimension axis, assuming Xp
has the shape of [num_samples, num_timesteps, num_features]. To understand how Xp
is calculated, the implementation of y_shifted_sbd_vec
can be found here, and by extension the core function normalized_cc
. On line 37:
`cc = numpy.real(numpy.fft.ifft(numpy.fft.fft(s1, fft_sz, axis=0) *
numpy.conj(numpy.fft.fft(s2, fft_sz, axis=0)), axis=0))`
It seems the discrete Fourier transform was done by axis=0, and s1
has the shape [num_timesteps, num_features]. Therefore, each individual feature is transformed individually. Next, the function returns
`return numpy.real(cc).sum(axis=-1) / denom`
, which seems to me that the calculated cc is summed up over all features (axis=-1
). Going back to the beginning of the discussion, I think this is why the cluster center is the same across all features.
Just my two cents... Would love to hear what the author @rtavenar thinks.
Thanks for your thorough comment, Oustella.
Perhaps KShapes should not accept multidimensional timeseries as input then?
I am curious if there has been a resolution over this issue or plans on how to handle this case?
Naively, is it possible to take the cross-correlation over multiple sources? Or even apply cross-correlation independently for each source so that each source can weigh into the shape based detection for each cluster?
If not, it seems like the code should throw a warning if someone passes a 3D matrix into the k-shape code so that users know that the output might not be what they expect.
Love using the code, thank you for all the hard work!
Hello. I executed K-shape, Kmeans and Kernel K-Means methods for multivariate ts dataset. But, K-shape doesn`t seem so efficient, at least for this data structure. Any help will be appreciated!
Last but not least, thank you for your all effort!
I tested solving the eigenvector mu_k
corresponding to the maximum eigenvalue for each dimension separately, and the results are as follows.
The specific code changes are to modify the content of the _shape_extraction
function to the following:
def _shape_extraction(self, X, k):
sz = X.shape[1]
d = X.shape[2]
Xp = y_shifted_sbd_vec(self.cluster_centers_[k], X[self.labels_ == k],
norm_ref=-1,
norms_dataset=self.norms_[self.labels_ == k])
mu_k_list = []
for i in range(d):
S = numpy.dot(Xp[:, :, i].T, Xp[:, :, i])
Q = numpy.eye(sz) - numpy.ones((sz, sz)) / sz
M = numpy.dot(Q.T, numpy.dot(S, Q))
_, vec = numpy.linalg.eigh(M)
mu_k = vec[:, -1].reshape((sz, 1))
# The way the optimization problem is (ill-)formulated, both mu_k and
# -mu_k are candidates for barycenters
# In the following, we check which one is best candidate
dist_plus_mu = numpy.sum(numpy.linalg.norm(Xp[:,:,i].reshape(-1,sz,1) - mu_k, axis=(1, 2)))
dist_minus_mu = numpy.sum(numpy.linalg.norm(Xp[:,:,i].reshape(-1,sz,1) + mu_k, axis=(1, 2)))
if dist_minus_mu < dist_plus_mu:
mu_k *= -1
mu_k_list.append(mu_k)
mu_k = numpy.hstack(mu_k_list)
return mu_k
Hope that helps!
Describe the bug When clustering multivariate timeseries, KShapes returns the same cluster center for each dimension. When I generate 3-dimensional timeseries of 8 catogeries, TimeSeriesKMeans finds these nicely:
KShapes, however, returns the same cluster center for each dimension, and generally fails to separate between the different combinations of motifs:
To Reproduce
Expected behavior I expect KShapes to approximate representative shapes in each dimension, and that each dimension within a cluster can have its own unique cluster center.
Environment (please complete the following information):