rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.26k stars 535 forks source link

[FEA] Add TimeSeriesSplit #6002

Open ZeroCool2u opened 3 months ago

ZeroCool2u commented 3 months ago

Is your feature request related to a problem? Please describe. I would like to be able to use cuML to work on time series problems, especially ones that require train/test data splits that are time series focused. To do this I need to use the sklearn TimeSeriesSplit object.

Describe the solution you'd like I would like a cuML equivalent of the TimeSeriesSplit class that is available in sklearn that can be used directly as part of a cuML Pipeline object and with the cross_val_score method.

Describe alternatives you've considered I could reimplement this myself from scratch, but this would be error prone and generally risky as poor time series splitting behavior is a common source of data leakage in ML problems.

Additional context I have an SVR model that takes ~8 minutes to train per split. I have a dataset that is ~1 million observations and I need to train across thousands of splits in this dataset meaning my runtime is 8 min * N where N is large. On a system with a 7950X3D with 32 processes and 32 GB 6000 MT/S RAM cross validation using the TimeSeriesSplit ran for more than 5 days. Running the SVR model on the same system using cuML with an RTX 3090 decreased per split training to less than 30 seconds (via WSL2). However, I cannot completely migrate to cuML without the TimeSeriesSplit implementation.

dantegd commented 3 months ago

Thanks for the issue! We will look into it, SVMs in general are an area of personal interest so would love to see this here.