H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Let's add an id_column to all our algorithms, grid and AutoML functions. Right now, if you have pooled-repeated measures data (one ID/person/cluster contributes multiple rows to the training set), then the only way to guarantee that all rows belonging to a single ID will be associated with a single fold is to use the fold_column argument. If the fold partitioning is not stratified by ID, then we get data leakage across folds. The user-specified fold_column method requires the user to code the stratification-by-ID themselves, which is a pain.
Currently, there is a "Stratified" option in fold_assignment but that only stratifies by the response column (classification only) to ensure that you get an even number of each class in each fold.
When the id_column is specified, then this will automatically trigger stratification-by-id when cross-validation is used. Let's think about whether we want to force the user to also specify fold_assignment = "Stratified" as well, or if specifying the id_column should be enough. We will need to handle the case where id_column is specified and fold_column is set to something other than "AUTO" or "Stratified".
Notes:
id_column defaults to NULL/None
this column should be automatically excluded from the set of predictors, even if it's included in the x argument
Let's add an
id_column
to all our algorithms, grid and AutoML functions. Right now, if you have pooled-repeated measures data (one ID/person/cluster contributes multiple rows to the training set), then the only way to guarantee that all rows belonging to a single ID will be associated with a single fold is to use thefold_column
argument. If the fold partitioning is not stratified by ID, then we get data leakage across folds. The user-specified fold_column method requires the user to code the stratification-by-ID themselves, which is a pain.Currently, there is a "Stratified" option in
fold_assignment
but that only stratifies by the response column (classification only) to ensure that you get an even number of each class in each fold.When the
id_column
is specified, then this will automatically trigger stratification-by-id when cross-validation is used. Let's think about whether we want to force the user to also specifyfold_assignment = "Stratified"
as well, or if specifying theid_column
should be enough. We will need to handle the case whereid_column
is specified andfold_column
is set to something other than "AUTO" or "Stratified".Notes:
x
argumentA request for a more generic version of this (stratify on any column) exists here: https://0xdata.atlassian.net/browse/PUBDEV-1848