h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Add `id_column` to all algos, grid & AutoML for proper CV fold stratification #11327

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Let's add an id_column to all our algorithms, grid and AutoML functions. Right now, if you have pooled-repeated measures data (one ID/person/cluster contributes multiple rows to the training set), then the only way to guarantee that all rows belonging to a single ID will be associated with a single fold is to use the fold_column argument. If the fold partitioning is not stratified by ID, then we get data leakage across folds. The user-specified fold_column method requires the user to code the stratification-by-ID themselves, which is a pain.

Currently, there is a "Stratified" option in fold_assignment but that only stratifies by the response column (classification only) to ensure that you get an even number of each class in each fold.

When the id_column is specified, then this will automatically trigger stratification-by-id when cross-validation is used. Let's think about whether we want to force the user to also specify fold_assignment = "Stratified" as well, or if specifying the id_column should be enough. We will need to handle the case where id_column is specified and fold_column is set to something other than "AUTO" or "Stratified".

Notes:

A request for a more generic version of this (stratify on any column) exists here: https://0xdata.atlassian.net/browse/PUBDEV-1848

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4442 Assignee: UNASSIGNED Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A