Open maz369 opened 4 years ago
Hi @maz369
SVMs often work with kernel functions internally, which calculates the similarity between each pair of training samples. In your case, this comes down to a 100_000x100_000
matrix.
Nevertheless, we should probably look into a more memory-efficient representation of these pairwise similarities (e.g. by using sparse matrices).
Thank you for the explanation. It makes sense now.
Most welcome. I will reopen the issue for now, as I do believe that it should be possible (in the future) to fit a dataset with 100K timeseries with perhaps more memory-efficient data structures (perhaps take a look at how sklearn handles larger datasets with their SVMs).
EDIT: Based on the doc page of sklearn.svm.SVC it seems that they advise to either use no kernel (LinearSVC) or something called Nystroem kernel approximation for large datasets:
The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.svm.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer.
BTW, not only SVM need to be improved, DTW also runs out of memories if the input data is with more than 100k samples. Sparse matrix is a direct solution.
Have you maybe found a way to speed up the training? I need to perform training on a dataset with over 100k samples and it's taking forever. Even the silhouette score for 10k samples takes forever..
Perhaps Kernel Approximation could speed it up: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_approximation (Nystroem seems most relevant)
I have been working with the library and recently found out that TimeSeriesSVC().predict runs very slow and requires huge memory. Can you please let me know if there is a way around this issue? I am trying to make 100K predictions of 1D time series (length of each time series is less than 100 values) and it requires more than 100 GB of memory and take multiple days to get the result.
Thank you