tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.92k stars 342 forks source link

Error when using the Time Series K-means with data of variable length #283

Open falknerdominik opened 4 years ago

falknerdominik commented 4 years ago

I am using the TimeSeriesKMeans class to cluster simple time series data. The data length is variable and a wanted to cluster it first:

# load data as pd.DataFrame
data = get_ts(...)
data = to_time_series_dataset(X.values)

km = TimeSeriesKMeans(n_cluster=4, n_init=10, init='k-means++', metric='dtw')
km.fit(data)

After running this i get the following error (same with other metrics e.g. softdtw):

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

When i resample the data using: TimeSeriesResampler(sz=80) it works.

rtavenar commented 4 years ago

Thanks for the bug report.

Could you please update to the latest tslearn version and let us know
if you still experience the bug ?

Dominik Falkner notifications@github.com a écrit :

I am using the TimeSeriesKMeans class to cluster simple time
series data. The data length is variable and a wanted to cluster it
first:

# load data as pd.DataFrame
data = get_ts(...)
data = to_time_series_dataset(X.values)

km = TimeSeriesKMeans(n_cluster=4, n_init=10, init='k-means++', metric='dtw')
km.fit(data)

After running this i get the following error (same with other
metrics e.g. dtw):

ValueError: Input contains NaN, infinity or a value too large for  
dtype('float64').
  • OS: Windows 10
  • tslearn version: 0.3.1

When i resample the data using: TimeSeriesResampler(sz=80) it works.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/tslearn-team/tslearn/issues/283

falknerdominik commented 4 years ago

Thanks for the quick reply. Updated to version '0.4.1' and the problem still persists.

rtavenar commented 4 years ago

Could you please provide the full error message so that we can spot in which step the problem is happening?

falknerdominik commented 4 years ago

Sure. I used **** to mask parts where the stacktrace contains project specific code I am not allowed to share.

Traceback (most recent call last):
  File "****", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "****", line 85, in _run_code
    exec(code, run_globals)
  File "****", line 213, in <module>
    run()
  File "****\__main__.py", line 209, in run
    cluster_with_sequences(****)
  File "****\__main__.py", line 199, in cluster_with_sequences
    ****
  File "****\__main__.py", line 97, in compute_and_evaluate_model
    value = calc(data, estimator.labels_, **cvi.kwargs)
  File "****\lib\site-packages\tslearn\clustering.py", line 237, in silhouette_score
    **kwds)
  File "****\lib\site-packages\sklearn\metrics\cluster\_unsupervised.py", line 117, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "****\lib\site-packages\sklearn\metrics\cluster\_unsupervised.py", line 213, in silhouette_samples
    X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
  File "****\lib\site-packages\sklearn\utils\validation.py", line 755, in check_X_y
    estimator=estimator)
  File "****\lib\site-packages\sklearn\utils\validation.py", line 578, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "****\lib\site-packages\sklearn\utils\validation.py", line 60, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
rtavenar commented 4 years ago

It seems the error occurs during a call to silhouette_score.

If you could build a minimum working example, I could probably help further.

Romain Tavenard Maître de conférences / Assistant professor Univ. Rennes - LETG Tél. / Phone : +33 2 99 14 18 04 http://rtavenar.github.io/research/ http://rtavenar.github.io/research/

Le 25 août 2020 à 17:08, Dominik Falkner notifications@github.com a écrit :

Sure. I used **** to mask parts where the stacktrace contains project specific code I am not allowed to share.

Traceback (most recent call last): File "", line 193, in _run_module_as_main "main", mod_spec) File "", line 85, in _run_code exec(code, run_globals) File "", line 213, in run() File "__main.py", line 209, in run cluster_with_sequences() File "\main__.py", line 199, in cluster_with_sequences


File "**__main__.py", line 97, in compute_and_evaluatemodel value = calc(data, estimator.labels, cvi.kwargs) File "**\lib\site-packages\tslearn\clustering.py", line 237, in silhouette_score kwds) File "**\lib\site-packages\sklearn\metrics\cluster_unsupervised.py", line 117, in silhouette_score return np.mean(silhouette_samples(X, labels, metric=metric, kwds)) File "\lib\site-packages\sklearn\metrics\cluster_unsupervised.py", line 213, in silhouette_samples X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) File "\lib\site-packages\sklearn\utils\validation.py", line 755, in check_X_y estimator=estimator) File "\lib\site-packages\sklearn\utils\validation.py", line 578, in check_array allow_nan=force_all_finite == 'allow-nan') File "\lib\site-packages\sklearn\utils\validation.py", line 60, in _assert_all_finite msg_dtype if msg_dtype is not None else X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tslearn-team/tslearn/issues/283#issuecomment-680084427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELAREZIUQT6MFSVJV4PMFLSCPHYXANCNFSM4QFYTHXQ.

falknerdominik commented 4 years ago

Found a bug in my code.

Issue can be closed.

Huanle commented 3 years ago

Found a bug in my code.

Issue can be closed.

Hi @falknerdominik , How did you fix the issue? Thanks.

falknerdominik commented 3 years ago

Hi @Huanle, I calculated the silhouette score using the euclidean distance, which results in the error above because the time series did not have equal length. I was using a generic pipeline that started the process - so the stacktrace did not really help.

Maybe a throw a better warning @rtavenar when the silhouette score with euclidean distance is used?

Huanle commented 3 years ago

Thanks @falknerdominik . this makes sense to me.