tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 340 forks source link

Memory issue with TimeSeriesKMeans and DTW for large dataset #350

Open bfocassio opened 3 years ago

bfocassio commented 3 years ago

Describe the bug I'm trying to perform the clustering of a large dataset of time series using the TimeSeriesKMeans and dtw. However, the clustering is killed due to memory issues.

Inspired by this post, I decided to track the memory consumption of the clustering. In the MWE, I'm using the track_memory decorator (here).

The data itself uses more or less 0.6 MB. The trained model uses ~20 MB. However, the fitting process reaches a peak in memory larger than 500 MB. For my real dataset with ~400 MB, the training is impracticable.

Is there any workaround? Can someone help me to reduce this peak in memory usage?

The number of peaks is proportional to the number of Kmeans iterations. Using euclidean metric instead of dtw fixes the memory problem, but it would not be appropriate for my original dataset.

To Reproduce

MWE:

# basic
import sys
import numpy as np

# matplotlib
import matplotlib.pyplot as plt

# better plots
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
plt.style.use('bmh')

# importing custom memory tracking decorator
from track_memory import track_memory_use, plot_memory_use

from tslearn.clustering import TimeSeriesKMeans
from tslearn.generators import random_walks

import time

import pandas as pd

@track_memory_use(close=False, return_history=True)
def train_data():

    # reading train data
    df_train = random_walks(n_ts=42, sz=1600, d=1)
    time.sleep(0.001)

    return df_train

# executing
df_train, mem_history_1 = train_data()
plt.show()

@track_memory_use(close=False, return_history=True)
def fit_model(train_data):

    # fitting model
    est = TimeSeriesKMeans(n_clusters=5, metric="dtw",n_init=1)
    est.fit(train_data)
    time.sleep(0.01)

    return est

model, mem_history_2 = fit_model(df_train)
plt.show()

print('Number of iterations: ',model.n_iter_)

@track_memory_use(close=True, return_history=True)
def predict_labels(test_data,model):

    y_test = model.predict(test_data)
    cluster_centers = model.cluster_centers_
    inertia = model.inertia_
    nclusters = cluster_centers.shape[0]
    time.sleep(0.01)

    return {'y_pred': y_test, 'cluster_centers': cluster_centers, 'inertia': inertia}

model_results, mem_history_3 = predict_labels(df_train,model)
plt.show()

print('Inertia: ',model_results['inertia'])

# putting memory usage together
total_mem_use = pd.concat([
    pd.DataFrame({'history': mem_history_1, 'step': 'train_data', 'color': 'red', 'offset':0}),
    pd.DataFrame({'history': mem_history_2, 'step': 'fit_model', 'color': 'green', 'offset':len(mem_history_1)}),
    pd.DataFrame({'history': mem_history_3, 'step': 'predict_labels', 'color': 'blue', 'offset':len(mem_history_1)+len(mem_history_2)})
])

# plotting
plt.figure(figsize=(10,3), dpi=120)
for step, group in total_mem_use.groupby('step'):
    plot_memory_use(history=group['history'].values, 
                    fn_name='Complete pipeline', 
                    open_figure=False, 
                    offset=group['offset'].unique(),
                    color=group['color'], 
                    label=step)

plt.show()

Environment:

rtavenar commented 3 years ago

Hi @bfocassio

This should definitely be investigated. I am already aware that our implementation induces a raise in memory usage when time series of different lengths are at stake (since we cast them to a single numpy array whose size is that of the longest time series), but there might be other issues.

If anyone has time to work on that, I think it would be highly valuable for tslearn.

JoaquinDF-UniLU commented 2 years ago

Hello @rtavenar and @bfocassio I have the same problem. It looks like when I am using dtw the memory explodes. I have 300 univariate time series with close to 9000 observations and from my experiments I need more than 256Gb of RAM and maybe even more.

Best regards,

CMagnoDFB commented 1 year ago

Same issue here. Large dataset, TimeSeriesKMeans and dtw.