h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Apache License 2.0

6.92k stars 2k forks source link

There are some datasets which have a skewed response, and if we don't log it before running AutoML (in particular, Stacked Ensemble GLM metalearner), we get bad results.

For regression problems, we must:

Decide when to log the response. Here are some options:
- run two GLMs (one logged and one not logged) with CV and if the logged one is better, proceed.
- also consider power and sqrt
Internally log the response before running any of the algos
Reverse the log (see SO post below) before making predictions

This was brought up on StackOverflow a while ago: https://stackoverflow.com/questions/48330026/how-to-handle-a-skewed-response-in-h2o-algorithms

Here is an example: {code} install.packages("AmesHousing") library(AmesHousing) ames <- make_ames()

library(h2o) h2o.init()

train <- as.h2o(ames) y <- "Sale_Price"

aml <- h2o.automl(y = y, training_frame = train, max_models = 10, seed = 1) print(aml@leaderboard, n = nrow(aml@leaderboard))

model_ids <- as.data.frame(aml@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner)

response is skewed

library(dplyr) hist(ames %>% pull(Sale_Price), breaks = 100)

Try again by logging the response

train[,"log_Sale_Price"] <- h2o.log(train[,y]) log_y <- "log_Sale_Price" x <- setdiff(names(train), c(y, log_y))

aml2 <- h2o.automl(y = log_y, x = x, training_frame = train, max_models = 10, seed = 1) print(aml2@leaderboard, n = nrow(aml2@leaderboard))

model_ids <- as.data.frame(aml2@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner) {code}

JIRA Issue Migration Info

Jira Issue: PUBDEV-6536 Assignee: Tomas Fryda Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: Available (Count: 5) Development PRs: N/A

Attachments From Jira

Attachment Name: Screen Shot 2019-05-29 at 10.17.20 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.20 AM.png

Attachment Name: Screen Shot 2019-05-29 at 10.17.27 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.27 AM.png

Attachment Name: Screen Shot 2019-05-29 at 10.27.59 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.27.59 AM.png

Attachment Name: Screen Shot 2019-05-29 at 9.31.07 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.07 AM.png

Attachment Name: Screen Shot 2019-05-29 at 9.31.24 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.24 AM.png

h2oai / h2o-3

Automatically log a skewed response variable in AutoML for improved results #9094

response is skewed

Try again by logging the response