Open exalate-issue-sync[bot] opened 1 year ago
Erin LeDell commented: I have confirmed that this is a problem that's specifc to AutoML. If you run the GBM grid manually via R with same early stopping params, you'll get 50 models.
This might have something to do with the fact that we first train 5 "special" GBMs and then extend an existing grid? Otherwise it probably means that early stopping params are not being piped through to the GBM grid in AutoML properly.
{code} library(h2o)
h2o.init()
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/jira/automl_gbm_missing_train.csv") y <- 9 #response train[,y] <- as.factor(train[,y])
ss <- h2o.splitFrame(train, ratios = 0.95, seed = 1) train <- ss[[1]] valid <- ss[[2]]
gbm_params <- list(max_depth = seq(3, 17, 1), min_rows = c(1, 5, 10, 15, 30, 100), learn_rate = c(0.001, 0.005, 0.008, 0.01, 0.05, 0.08, 0.1, 0.5, 0.8), sample_rate = seq(0.5, 1.0, 0.1), col_sample_rate = seq(0.4, 0.7, 1.0), col_sample_rate_per_tree = seq(0.4, 0.7, 1.0), min_split_improvement = c(1e-4, 1e-5)) search_criteria <- list(strategy = "RandomDiscrete", max_models = 50, seed = 1, stopping_metric = "AUC", stopping_rounds = 5, stopping_tolerance = 0.0001)
starttime <- Sys.time() gbm_grid <- h2o.grid("gbm", x = x, y = y, grid_id = "automl_gbm_grid", training_frame = train, validation_frame = valid, ntrees = 10000, seed = 1, nfolds = 5, keep_cross_validation_predictions = TRUE, score_tree_interval = 5, hyper_params = gbm_params, search_criteria = search_criteria, stopping_metric = "AUC", stopping_tolerance = 0.0001, stopping_rounds = 5) endtime <- Sys.time() print(endtime) print(endtime - starttime)
gbm_gridperf <- h2o.getGrid(grid_id = "automl_gbm_grid", sort_by = "auc", decreasing = TRUE) print(gbm_gridperf) {code}
Erin LeDell commented: Here's the line of code in AutoML.java where we clone the search criteria from the build_control params and pass that to Grid: https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/AutoML.java#L656 So in theory, these params should be getting passed through to the grid properly (though there's still something wrong).
JIRA Issue Migration Info
Jira Issue: PUBDEV-5318 Assignee: Erin LeDell Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
There should be more models here... the GBM grid top models are not close enough in performance (as specified by stopping_tolerance) and it should have trained more models. This is on master (stable release is actually still skipping the GBM grid due to a bug).
{code} library(h2o)
h2o.init()
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/jira/automl_gbm_missing_train.csv") y <- 9 #response train[,y] <- as.factor(train[,y])
ss <- h2o.splitFrame(train, ratios = 0.95, seed = 1) train <- ss[[1]] valid <- ss[[2]]
Sys.time() aml <- h2o.automl(y = y, training_frame = train, validation_frame = valid, max_models = 50, stopping_tolerance = 0.0001, stopping_rounds = 5, seed = 1) Sys.time() print("Done!") #this took an hour to run on my laptop {code}
Leaderboard. By default classification uses logloss for early stopping. However, the top two GBM models in the grid differ in the third decimal place. Since we set stopping_tolerance = 0.0001 they should be no more than 0.0001 apart: {code}
[15 rows x 3 columns] {code}