h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 2k forks source link

GLRM stalls when running with Gridsearch #12627

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka reports the following:

it's taking 10x longer for me to train a GLRM model with grid search than outside of grid search

I'm not sure why i ran grid search for one model and it took ~10 mins when I train the same model normally it only takes 1 min

Initate H2O Cluster

library('h2o') h2o.init()

Import Data

houses_data <- h2o.importFile("/Users/megankurka/Downloads/kc_house_data.csv")

Convert ordinal columns from integer to categorical

houses_data$condition <- as.factor(houses_data$condition) houses_data$grade <- as.factor(houses_data$grade)

Create training and validation data

houses_data <- h2o.assign(houses_data, "houses_data.hex") miss_data <- h2o.assign(houses_data, "miss_data.hex") h2o.insertMissingValues(data = miss_data, fraction = 0.15, seed = 1234)

Initial model

glrm_k <- 5 gamma <- 2 glrm_cols <- which(!(colnames(houses_data) %in% c("id", "date", "zipcode"))) base_model <- h2o.glrm(training_frame = miss_data, cols = glrm_cols, validation_frame = houses_data, model_id = "base_glrm", seed = 1234, k = glrm_k, gamma_x = gamma, gamma_y = gamma, regularization_x = "Quadratic", regularization_y = "Quadratic", transform = "STANDARDIZE", impute_original = TRUE)

print(h2o.performance(base_model, valid = T))

Ordinal Loss Model

losses <- data.frame('index' = which(colnames(houses_data) %in% c("condition", "grade")) - 1, 'loss' = rep("Ordinal", 2), stringsAsFactors = FALSE)

loss_specific_model <- h2o.glrm(training_frame = miss_data, cols = glrm_cols, validation_frame = houses_data, model_id = "loss_specific_glrm", seed = 1234, k = glrm_k, gamma_x = gamma, gamma_y = gamma, regularization_x = "Quadratic", regularization_y = "Quadratic", transform = "STANDARDIZE", impute_original = TRUE, loss_by_col = losses$loss, loss_by_col_idx = losses$index)

print(h2o.performance(loss_specific_model, valid = T))

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5772 Assignee: Wendy Reporter: Wendy State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A