h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.91k stars 2k forks source link

AutoML max_runtime_secs model parameter always returns zero #7709

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

When a model is trained with help of AutoML and the max_runtime_secs is retrieved it always returns zero While testing models, we also want to compare the duration of a training and compare varimp for example.

When we try to retrieve the back @allparameters$max_runtime_secs, it is always zero, although the max_runtime_secs_per_model has a given value.

We thought of using the @model$runtime as a proxy for the max_runtime parameter, because we assume that models are stored (MOJO) and used at a later time. Except, it seems the @model$runtime is not robust, when a MOJO is uploaded @model$runtime gets the float datetime value of the upload moment and not a runtime in the same way when a model is in memory or saved using h2o.saveModel.

The retun of a zero value persists when saving and loading models the normal way. So it seems H2O does seems to take into account the max_runtime, but not receiving/setting the value.

{code:R} library(h2o) h2o.init()

Import a sample binary outcome train/test set into H2O

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

Identify predictors and response

y <- "response" x <- setdiff(names(train), y)

For binary classification, response should be a factor

train[, y] <- as.factor(train[, y]) test[, y] <- as.factor(test[, y])

Run AutoML for 20 base models (limited to 1 hour max runtime by default)

max_models <- 20 max_runtime_secs_per_model <- 60

aml <- h2o.automl(x = x, y = y, training_frame = train, max_models = max_models, max_runtime_secs_per_model = max_runtime_secs_per_model, max_runtime_secs = max_runtime_secs_per_model * max_models, exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable seed = 1537)

model_ids <- as.data.frame(aml@leaderboard$model_id)

for(i in 1:nrow(model_ids)) {

model <- h2o.getModel(aml@leaderboard[i, 1]) # get model object in environment

print(c(model@model$run_time, model@allparameters$max_runtime_secs)) }

[1] 790 0

[1] 676 0

[1] 8851 0

[1] 656 0

[1] 781 0

[1] 893 0

[1] 1013 0

[1] 3817 0

[1] 370 0

[1] 1716 0

[1] 722 0

[1] 651 0

[1] 536 0

[1] 1401 0

[1] 450 0

[1] 7454 0

[1] 489 0

[1] 1022 0

[1] 541 0

[1] 541 0

max_models <- 20 max_runtime_secs_per_model <- 120

aml <- h2o.automl(x = x, y = y, training_frame = train, max_models = max_models, max_runtime_secs_per_model = max_runtime_secs_per_model, max_runtime_secs = max_runtime_secs_per_model * max_models, exclude_algos = c("StackedEnsemble", "DeepLearning"), # those make models that are way to big... or not explainable seed = 3715)

model_ids <- as.data.frame(aml@leaderboard$model_id)

for(i in 1:nrow(model_ids)) {

model <- h2o.getModel(aml@leaderboard[i, 1]) # get model object in environment

print(c(model@model$run_time, model@allparameters$max_runtime_secs)) }

[1] 877 0

[1] 575 0

[1] 697 0

[1] 605 0

[1] 1615 0

[1] 488 0

[1] 953 0

[1] 660 0

[1] 653 0

[1] 1667 0

[1] 3273 0

[1] 1791 0

[1] 767 0

[1] 1364 0

[1] 369 0

[1] 394 0

[1] 464 0

[1] 447 0

[1] 661 0

[1] 556 0

{code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7936 Assignee: Sebastien Poirier Reporter: Xiaoming op de Hoek State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A