H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Megan Kurka reports the following:
it's taking 10x longer for me to train a GLRM model with grid search than outside of grid search
I'm not sure why i ran grid search for one model and it took ~10 mins when I train the same model normally it only takes 1 min
Initate H2O Cluster
library('h2o') h2o.init()
Import Data
houses_data <- h2o.importFile("/Users/megankurka/Downloads/kc_house_data.csv")
Convert ordinal columns from integer to categorical
houses_data$condition <- as.factor(houses_data$condition) houses_data$grade <- as.factor(houses_data$grade)
Create training and validation data
houses_data <- h2o.assign(houses_data, "houses_data.hex") miss_data <- h2o.assign(houses_data, "miss_data.hex") h2o.insertMissingValues(data = miss_data, fraction = 0.15, seed = 1234)
Initial model
glrm_k <- 5 gamma <- 2 glrm_cols <- which(!(colnames(houses_data) %in% c("id", "date", "zipcode"))) base_model <- h2o.glrm(training_frame = miss_data, cols = glrm_cols, validation_frame = houses_data, model_id = "base_glrm", seed = 1234, k = glrm_k, gamma_x = gamma, gamma_y = gamma, regularization_x = "Quadratic", regularization_y = "Quadratic", transform = "STANDARDIZE", impute_original = TRUE)
print(h2o.performance(base_model, valid = T))
Ordinal Loss Model
losses <- data.frame('index' = which(colnames(houses_data) %in% c("condition", "grade")) - 1, 'loss' = rep("Ordinal", 2), stringsAsFactors = FALSE)
loss_specific_model <- h2o.glrm(training_frame = miss_data, cols = glrm_cols, validation_frame = houses_data, model_id = "loss_specific_glrm", seed = 1234, k = glrm_k, gamma_x = gamma, gamma_y = gamma, regularization_x = "Quadratic", regularization_y = "Quadratic", transform = "STANDARDIZE", impute_original = TRUE, loss_by_col = losses$loss, loss_by_col_idx = losses$index)
print(h2o.performance(loss_specific_model, valid = T))