tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.88k stars 336 forks source link

KMeans question #269

Open ssimontacchi opened 4 years ago

ssimontacchi commented 4 years ago

Hi, Thanks for the awesome library!

So I am running a Kmeans on lots of different datasets, which all have roughly four shapes, so I initialize with those shapes and it works well, except for just a few times. There are a few datasets that look different enough that I end up with empty clusters and the algorithm just hangs ("Resumed because of empty cluster" again and again).

I conceptually understand why this happens, but is there any way you know to avoid it, or finish at least? I'm not sure I understand what's going on behind the scenes well enough to debug any further. Thank you!

GillesVandewiele commented 4 years ago

Hi @ssimontacchi,

What you are describing is a common problem of KMeans (not only when the custom variant to timeseries but also the scikit-learn variant has these issues). Therefore, the KMeans algorithm is often ran multiple times with different random initializations, some score such as silhouette_score is then used to decide which of all those random restarts was the most qualitative.

On the other hand, the algorithm definitely should not hang indeed... I'll label this with "bug" for now!

GillesVandewiele commented 4 years ago

Would it be possible to construct some minimal example with a small dataset? This would help a lot to debug

rtavenar commented 4 years ago

One important question is which init method are you using ? Then what would you suggest Gilles when a cluster is empty ? Does anyone know what sklearn does in this case ?

GillesVandewiele commented 4 years ago

Good question... I think some check is required that checks if the total number of unique clusters is equal to the specified number of clusters, if that is not the case, some warning should be raised that the number of clusters is probably not set well (and perhaps also display the number of clusters with the highest silhouette that we found over the random initializations).

Sklearn will just assign some random values to the cluster in case there is an empty cluster apparently. (source)

rtavenar commented 4 years ago

Sklearn will just assign some random values to the cluster in case there is an empty cluster apparently. (source)

In the link you provide, they state that it's not chosen at random btw:

A problem with k-means is that one or more clusters can be empty. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.

We'll have to check.

ssimontacchi commented 4 years ago

Hi, I have tried to make a minimal example but am having trouble recreating it (and can't share the datasets doing it, unfortunately). I believe the example in the colab is the general situation though.

Again, it should be pretty fast on these datasets but just seems to hang. My guess of what is happening is that it's reassigning empty clusters indefinitely, but I'm not sure. Here is what the log looks like:

WARNING: QApplication was not created in the main() thread.
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster
Resumed because of empty cluster

It only shows "Resumed because of empty cluster" 10 times though, and I'm not sure why that would happen either if it were failing to reassign clusters infinitely.

I'm trying to figure out what's different about this data and I'll let you know when I figure it out. Thanks!

ssimontacchi commented 4 years ago

I'm really having a hard time getting it to hang but found that a lot, though not all, that were having this problem have tiny amounts of data. If you run it with only a few data points you will definitely get lots of empty cluster messages.

I guess the question is figuring out if there is some condition that could cause it not to converge?

felipeffm commented 2 years ago

Columns with np.nan values return the same print ("Resumed because of empty cluster" )

jccarrascog commented 1 year ago

Hi everyone! First of all, thanks for creating such an amazing package. I am struggling with an error for my dataset and I hope that anyone could help me out with this.

I am getting the following error: "ValueError: cannot reshape array of size 0 into shape (0,newaxis)". Traces of the error can be found below.

Code that generates the output:

`from tslearn.clustering import TimeSeriesKMeans, KShape, silhouette_score

km_euc = TimeSeriesKMeans(n_clusters=2, max_iter=5,metric="dtw", random_state=0).fit(tserie) labels_euc = kmeuc.labels`

Error:

`--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_19116/338135866.py in 3 start_time = time.time() 4 ----> 5 km_euc = TimeSeriesKMeans(n_clusters=2, max_iter=5,metric="dtw", random_state=0).fit(tserie) 6 labels_euc = kmeuc.labels

~\Anaconda3\lib\site-packages\tslearn\clustering\kmeans.py in fit(self, X, y) 778 print("Init %d" % (n_successful + 1)) 779 n_attempts += 1 --> 780 self._fit_oneinit(X, x_squarednorms, rs) 781 if self.inertia < min_inertia: 782 best_correct_centroids = self.clustercenters.copy()

~\Anaconda3\lib\site-packages\tslearn\clustering\kmeans.py in _fit_one_init(self, X, x_squared_norms, rs) 656 raise ValueError("Value %r for parameter 'init'" 657 "is invalid" % self.init) --> 658 self.clustercenters = _check_full_length(self.clustercenters) 659 old_inertia = numpy.inf 660

~\Anaconda3\lib\site-packages\tslearn\clustering\utils.py in _check_full_length(centroids) 42 """ 43 resampler = TimeSeriesResampler(sz=centroids.shape[1]) ---> 44 return resampler.fit_transform(centroids) 45 46

~\Anaconda3\lib\site-packages\tslearn\preprocessing\preprocessing.py in fit_transform(self, X, y, kwargs) 69 Resampled time series dataset. 70 """ ---> 71 return self.fit(X).transform(X) 72 73 def transform(self, X, y=None, kwargs):

~\Anaconda3\lib\site-packages\tslearn\preprocessing\preprocessing.py in transform(self, X, y, **kwargs) 95 sz = tssize(X[i]) 96 for di in range(d): ---> 97 f = interp1d(numpy.linspace(0, 1, sz), X_[i, :sz, di], 98 kind="slinear") 99 X_out[i, :, di] = f(xnew)

~\Anaconda3\lib\site-packages\scipy\interpolate\interpolate.py in init(self, x, y, kind, axis, copy, bounds_error, fill_value, assume_sorted) 472 # Interpolation goes internally along the first axis 473 self.y = y --> 474 self._y = self._reshape_yi(self.y) 475 self.x = x 476 del y, x # clean up namespace to prevent misuse; use attributes

~\Anaconda3\lib\site-packages\scipy\interpolate\polyint.py in _reshape_yi(self, yi, check) 108 self._y_extra_shape[:-self._y_axis]) 109 raise ValueError("Data must be of shape %s" % ok_shape) --> 110 return yi.reshape((yi.shape[0], -1)) 111 112 def _set_yi(self, yi, xi=None, axis=None):

ValueError: cannot reshape array of size 0 into shape (0,newaxis) `