h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.79k stars 1.99k forks source link

Purging and embargoing to deal with unintended data leaks in cross validation. #6532

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

These approaches are often used in financial ML. Can benefit a wide variety of ML tasks though.

In short: Adding a safety gap between the k-folds or train-, test- and validation splits.

These articles explain it in detail: [ https://medium.com/mlearning-ai/why-k-fold-cross-validation-is-failing-in-finance-65c895e83fdf| https://medium.com/mlearning-ai/why-k-fold-cross-validation-is-failing-in-finance-65c895e83fdf]

[https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/|https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/]

The Combinatorial Purged Cross Validation mentioned there (it is a little better explained here: [https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method|https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method]) helps creating more walk-forward paths that are purely out-of-sample for increased statistical significance. This was proposed by Marcos Lopez de Prado in the “Advances in financial machine learning”.

Would be great to have this out-of-the box or being able to pass the cross validation folds / index with gaps.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8858 Assignee: New H2O Bugs Reporter: N/A State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A