h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 2k forks source link

GBM grid is stopping too early in AutoML #12189

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

There should be more models here... the GBM grid top models are not close enough in performance (as specified by stopping_tolerance) and it should have trained more models. This is on master (stable release is actually still skipping the GBM grid due to a bug).

{code} library(h2o)

h2o.init()

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/jira/automl_gbm_missing_train.csv") y <- 9 #response train[,y] <- as.factor(train[,y])

ss <- h2o.splitFrame(train, ratios = 0.95, seed = 1) train <- ss[[1]] valid <- ss[[2]]

Sys.time() aml <- h2o.automl(y = y, training_frame = train, validation_frame = valid, max_models = 50, stopping_tolerance = 0.0001, stopping_rounds = 5, seed = 1) Sys.time() print("Done!") #this took an hour to run on my laptop {code}

Leaderboard. By default classification uses logloss for early stopping. However, the top two GBM models in the grid differ in the third decimal place. Since we set stopping_tolerance = 0.0001 they should be no more than 0.0001 apart: {code}

print(aml@leaderboard, n = nrow(aml@leaderboard)) model_id auc logloss 1 StackedEnsemble_AllModels_0_AutoML_20180211_105621 0.958695 0.266594 2 GBM_grid_0_AutoML_20180211_105621_model_0 0.958169 0.263463 3 GBM_grid_0_AutoML_20180211_105621_model_2 0.957924 0.269633 4 GBM_grid_0_AutoML_20180211_105621_model_7 0.957645 0.262431 5 GBM_grid_0_AutoML_20180211_105621_model_1 0.957187 0.266378 6 GBM_grid_0_AutoML_20180211_105621_model_4 0.956572 0.265144 7 GBM_grid_0_AutoML_20180211_105621_model_3 0.956515 0.277945 8 StackedEnsemble_BestOfFamily_0_AutoML_20180211_105621 0.955979 0.275831 9 GBM_grid_0_AutoML_20180211_105621_model_8 0.954675 0.272402 10 GBM_grid_0_AutoML_20180211_105621_model_6 0.942030 0.374466 11 GBM_grid_0_AutoML_20180211_105621_model_5 0.940216 0.406717 12 GBM_grid_0_AutoML_20180211_105621_model_9 0.917756 0.557688 13 XRT_0_AutoML_20180211_105621 0.862226 0.511022 14 DRF_0_AutoML_20180211_105621 0.858959 0.509545 15 DeepLearning_0_AutoML_20180211_105621 0.767724 1.258798

[15 rows x 3 columns] {code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: I have confirmed that this is a problem that's specifc to AutoML. If you run the GBM grid manually via R with same early stopping params, you'll get 50 models.

This might have something to do with the fact that we first train 5 "special" GBMs and then extend an existing grid? Otherwise it probably means that early stopping params are not being piped through to the GBM grid in AutoML properly.

{code} library(h2o)

h2o.init()

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/jira/automl_gbm_missing_train.csv") y <- 9 #response train[,y] <- as.factor(train[,y])

ss <- h2o.splitFrame(train, ratios = 0.95, seed = 1) train <- ss[[1]] valid <- ss[[2]]

GBM hyperparamters

gbm_params <- list(max_depth = seq(3, 17, 1), min_rows = c(1, 5, 10, 15, 30, 100), learn_rate = c(0.001, 0.005, 0.008, 0.01, 0.05, 0.08, 0.1, 0.5, 0.8), sample_rate = seq(0.5, 1.0, 0.1), col_sample_rate = seq(0.4, 0.7, 1.0), col_sample_rate_per_tree = seq(0.4, 0.7, 1.0), min_split_improvement = c(1e-4, 1e-5)) search_criteria <- list(strategy = "RandomDiscrete", max_models = 50, seed = 1, stopping_metric = "AUC", stopping_rounds = 5, stopping_tolerance = 0.0001)

Train and validate a grid of GBMs

starttime <- Sys.time() gbm_grid <- h2o.grid("gbm", x = x, y = y, grid_id = "automl_gbm_grid", training_frame = train, validation_frame = valid, ntrees = 10000, seed = 1, nfolds = 5, keep_cross_validation_predictions = TRUE, score_tree_interval = 5, hyper_params = gbm_params, search_criteria = search_criteria, stopping_metric = "AUC", stopping_tolerance = 0.0001, stopping_rounds = 5) endtime <- Sys.time() print(endtime) print(endtime - starttime)

gbm_gridperf <- h2o.getGrid(grid_id = "automl_gbm_grid", sort_by = "auc", decreasing = TRUE) print(gbm_gridperf) {code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Here's the line of code in AutoML.java where we clone the search criteria from the build_control params and pass that to Grid: https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/AutoML.java#L656 So in theory, these params should be getting passed through to the grid properly (though there's still something wrong).

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5318 Assignee: Erin LeDell Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A