tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.91k stars 339 forks source link

TimeSeriesSVC prediction is very slow #289

Open maz369 opened 4 years ago

maz369 commented 4 years ago

I have been working with the library and recently found out that TimeSeriesSVC().predict runs very slow and requires huge memory. Can you please let me know if there is a way around this issue? I am trying to make 100K predictions of 1D time series (length of each time series is less than 100 values) and it requires more than 100 GB of memory and take multiple days to get the result.

Thank you

GillesVandewiele commented 4 years ago

Hi @maz369

SVMs often work with kernel functions internally, which calculates the similarity between each pair of training samples. In your case, this comes down to a 100_000x100_000 matrix.

Nevertheless, we should probably look into a more memory-efficient representation of these pairwise similarities (e.g. by using sparse matrices).

maz369 commented 4 years ago

Thank you for the explanation. It makes sense now.

GillesVandewiele commented 4 years ago

Most welcome. I will reopen the issue for now, as I do believe that it should be possible (in the future) to fit a dataset with 100K timeseries with perhaps more memory-efficient data structures (perhaps take a look at how sklearn handles larger datasets with their SVMs).

EDIT: Based on the doc page of sklearn.svm.SVC it seems that they advise to either use no kernel (LinearSVC) or something called Nystroem kernel approximation for large datasets:

The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.svm.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer.
PercyLau commented 4 years ago

BTW, not only SVM need to be improved, DTW also runs out of memories if the input data is with more than 100k samples. Sparse matrix is a direct solution.

StankovskiA commented 2 years ago

Have you maybe found a way to speed up the training? I need to perform training on a dataset with over 100k samples and it's taking forever. Even the silhouette score for 10k samples takes forever..

GillesVandewiele commented 2 years ago

Perhaps Kernel Approximation could speed it up: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_approximation (Nystroem seems most relevant)