H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
When a model is trained with help of AutoML and the max_runtime_secs is retrieved it always returns zero
While testing models, we also want to compare the duration of a training and compare varimp for example.
When we try to retrieve the back @allparameters$max_runtime_secs, it is always zero, although the max_runtime_secs_per_model has a given value.
We thought of using the @model$runtime as a proxy for the max_runtime parameter, because we assume that models are stored (MOJO) and used at a later time.
Except, it seems the @model$runtime is not robust, when a MOJO is uploaded @model$runtime gets the float datetime value of the upload moment and not a runtime in the same way when a model is in memory or saved using h2o.saveModel.
The retun of a zero value persists when saving and loading models the normal way. So it seems H2O does seems to take into account the max_runtime, but not receiving/setting the value.
{code:R}
library(h2o)
h2o.init()
Import a sample binary outcome train/test set into H2O
Run AutoML for 20 base models (limited to 1 hour max runtime by default)
max_models <- 20
max_runtime_secs_per_model <- 60
aml <- h2o.automl(x = x, y = y, training_frame = train,
max_models = max_models,
max_runtime_secs_per_model = max_runtime_secs_per_model,
max_runtime_secs = max_runtime_secs_per_model * max_models,
exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable
seed = 1537)
aml <- h2o.automl(x = x, y = y, training_frame = train,
max_models = max_models,
max_runtime_secs_per_model = max_runtime_secs_per_model,
max_runtime_secs = max_runtime_secs_per_model * max_models,
exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable
seed = 3715)
When a model is trained with help of AutoML and the
max_runtime_secs
is retrieved it always returns zero While testing models, we also want to compare the duration of a training and compare varimp for example.When we try to retrieve the back @allparameters$max_runtime_secs, it is always zero, although the max_runtime_secs_per_model has a given value.
We thought of using the @model$runtime as a proxy for the max_runtime parameter, because we assume that models are stored (MOJO) and used at a later time. Except, it seems the @model$runtime is not robust, when a MOJO is uploaded @model$runtime gets the float datetime value of the upload moment and not a runtime in the same way when a model is in memory or saved using h2o.saveModel.
The retun of a zero value persists when saving and loading models the normal way. So it seems H2O does seems to take into account the max_runtime, but not receiving/setting the value.
{code:R} library(h2o) h2o.init()
Import a sample binary outcome train/test set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
Identify predictors and response
y <- "response" x <- setdiff(names(train), y)
For binary classification, response should be a factor
train[, y] <- as.factor(train[, y]) test[, y] <- as.factor(test[, y])
Run AutoML for 20 base models (limited to 1 hour max runtime by default)
max_models <- 20 max_runtime_secs_per_model <- 60
aml <- h2o.automl(x = x, y = y, training_frame = train, max_models = max_models, max_runtime_secs_per_model = max_runtime_secs_per_model, max_runtime_secs = max_runtime_secs_per_model * max_models, exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable seed = 1537)
model_ids <- as.data.frame(aml@leaderboard$model_id)
for(i in 1:nrow(model_ids)) {
model <- h2o.getModel(aml@leaderboard[i, 1]) # get model object in environment
print(c(model@model$run_time, model@allparameters$max_runtime_secs)) }
[1] 790 0
[1] 676 0
[1] 8851 0
[1] 656 0
[1] 781 0
[1] 893 0
[1] 1013 0
[1] 3817 0
[1] 370 0
[1] 1716 0
[1] 722 0
[1] 651 0
[1] 536 0
[1] 1401 0
[1] 450 0
[1] 7454 0
[1] 489 0
[1] 1022 0
[1] 541 0
[1] 541 0
max_models <- 20 max_runtime_secs_per_model <- 120
aml <- h2o.automl(x = x, y = y, training_frame = train, max_models = max_models, max_runtime_secs_per_model = max_runtime_secs_per_model, max_runtime_secs = max_runtime_secs_per_model * max_models, exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable seed = 3715)
model_ids <- as.data.frame(aml@leaderboard$model_id)
for(i in 1:nrow(model_ids)) {
model <- h2o.getModel(aml@leaderboard[i, 1]) # get model object in environment
print(c(model@model$run_time, model@allparameters$max_runtime_secs)) }
[1] 877 0
[1] 575 0
[1] 697 0
[1] 605 0
[1] 1615 0
[1] 488 0
[1] 953 0
[1] 660 0
[1] 653 0
[1] 1667 0
[1] 3273 0
[1] 1791 0
[1] 767 0
[1] 1364 0
[1] 369 0
[1] 394 0
[1] 464 0
[1] 447 0
[1] 661 0
[1] 556 0
{code}