wannesm / dtaidistance

Time series distances: Dynamic Time Warping (fast DTW implementation in C)
Other
1.08k stars 184 forks source link

ndim KMeans bug when use_c=True #210

Closed haggihaggi closed 2 months ago

haggihaggi commented 3 months ago

When I run the following

import numpy as np
from dtaidistance.clustering import KMeans

arr = np.random.random((10, 10, 3))

mod = KMeans(k=2)
cl, p = mod.fit(arr, use_c=True)
print(cl)

I get

{}
1.715506867936288
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[55], line 7
      4 arr = np.random.random((10, 10, 3))
      6 mod = KMeans(k=2)
----> 7 cl, p = mod.fit(arr, use_c=True)
      8 print(cl)

File ~\anaconda3\Lib\site-packages\dtaidistance\clustering\kmeans.py:287, in KMeans.fit(self, series, use_c, use_parallel, monitor_distances)
    285 # Initialisations
    286 if self.initialize_with_kmeanspp:
--> 287     self.means = self.kmeansplusplus_centers(self.series, use_c=use_c)
    288 elif self.initialize_with_kmedoids:
    289     self.means = self.kmedoids_centers(self.series, use_c=use_c)

File ~\anaconda3\Lib\site-packages\dtaidistance\clustering\kmeans.py:208, in KMeans.kmeansplusplus_centers(self, series, use_c)
    206 # First center is chosen randomly
    207 idx = np.random.randint(0, len(series))
--> 208 min_dists = np.power(fn(series, block=((idx, idx + 1), (0, len(series)), False),
    209                         compact=True, **self.dists_options), 2)
    210 indices.append(idx)
    212 for k_idx in range(1, self.k):
    213     # Compute the distance between each series and the nearest center that has already been chosen.
    214     # (Choose one new series at random as a new center, using a weighted probability distribution)
    215     # Select several new centers and then greedily chose the one that decreases pot as much as possible

File ~\anaconda3\Lib\site-packages\dtaidistance\dtw_ndim.py:449, in distance_matrix_fast(s, ndim, max_dist, max_length_diff, window, max_step, penalty, psi, block, compact, parallel, only_triu)
    447 """Fast C version of :meth:`distance_matrix`."""
    448 _check_library(raise_exception=True, include_omp=parallel)
--> 449 return distance_matrix(s, ndim=ndim, max_dist=max_dist, max_length_diff=max_length_diff,
    450                        window=window, max_step=max_step, penalty=penalty, psi=psi,
    451                        block=block, compact=compact, parallel=parallel,
    452                        use_c=True, show_progress=False, only_triu=only_triu)

File ~\anaconda3\Lib\site-packages\dtaidistance\dtw_ndim.py:434, in distance_matrix(s, ndim, max_dist, use_pruning, max_length_diff, window, max_step, penalty, psi, block, compact, parallel, use_c, use_mp, show_progress, only_triu)
    430     raise Exception(f'Unsupported combination of: parallel={parallel}, '
    431                     f'use_c={use_c}, dtw_cc_omp={dtw_cc_omp}, use_mp={use_mp}')
    433 exp_length = _distance_matrix_length(block, len(s))
--> 434 assert len(dists) == exp_length, "len(dists)={} != {}".format(len(dists), exp_length)
    435 if compact:
    436     return dists

AssertionError: len(dists)=9 != 10

If I set use_c=False I do not get an error.

Amirparsa-Sal commented 2 months ago

The same issue happens for me with the exact same traceback tree. Have you figured out the cause of problem?

wannesm commented 2 months ago

There was a bug in the parallelization that triggered this bug. This is fixed in commit 7ff5ef96b7b23dc5f19550f8b6957bb6e85d3459 . You can test with the current master branch (this requires installing from git with Cython compilation). It will also be still part of the next release that we are preparing.

Amirparsa-Sal commented 2 months ago

Thanks for your diligent follow-up. The issue is solved now.

wannesm commented 2 months ago

@Amirparsa-Sal Thanks for the feedback @haggihaggi Thanks for the minimal, reproducible example