tslearn-team / tslearn

The machine learning toolkit for time series analysis in Python
https://tslearn.readthedocs.io
BSD 2-Clause "Simplified" License
2.85k stars 336 forks source link

Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans. #414

Open sukhejai opened 2 years ago

sukhejai commented 2 years ago

Dear Dev Team,

@ecederstrand @rth @rflamary @apachaves @felixdivo

Can we use GPU and PySpark to improve on clustering time for TimeSeriesKMeans. I currently tried using n_jobs for parallel processing in Databricks but the time taken for clustering is same for 8 CPU and 32CPU machine. It clearly doesn't help.

Can you please suggest what can be the best approach to reduce the time matrix.

Thanks, Ishwar Sukheja

sukhejai commented 2 years ago

Hi Team,

Any updates?

Thanks, Ishwar

apachaves commented 2 years ago

Hi @sukhejai , I'm not part of the dev team but thank you for calling me here.

Have you tried to monitor the CPU use? I know in Databricks it might not be that simple and I also know there is overhead happening in the JVM layers below so I'm always cautious with what happens inside it.

Can I suggest maybe that you run a test with n_jobs outside Databricks? Maybe in a Jupyter notebook running locally in your machine. And then, open the resource monitor to double-check the parallelization is indeed happening and all CPUs are being used.

Finally, would be nice to have it copied here the information of this CPU use plus a small code snippet with the example you tried to run. I'm sure with that the dev team will be able to narrow down better the best solutions for you.

Hope it helps, I'm curious too.

Best, Anderson

justkrismanohar commented 1 year ago

Any update on this ? Currently I am trying to get this running on pyspark. @apachaves can we get timeseries kmeans to run with pyspark dataframe? the data I am working with is to big for pandas. Thanks.