h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Automatically log a skewed response variable in AutoML for improved results #9094

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

There are some datasets which have a skewed response, and if we don't log it before running AutoML (in particular, Stacked Ensemble GLM metalearner), we get bad results.

For regression problems, we must:

This was brought up on StackOverflow a while ago: https://stackoverflow.com/questions/48330026/how-to-handle-a-skewed-response-in-h2o-algorithms

Here is an example: {code} install.packages("AmesHousing") library(AmesHousing) ames <- make_ames()

library(h2o) h2o.init()

train <- as.h2o(ames) y <- "Sale_Price"

aml <- h2o.automl(y = y, training_frame = train, max_models = 10, seed = 1) print(aml@leaderboard, n = nrow(aml@leaderboard))

model_ids <- as.data.frame(aml@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner)

response is skewed

library(dplyr) hist(ames %>% pull(Sale_Price), breaks = 100)

Try again by logging the response

train[,"log_Sale_Price"] <- h2o.log(train[,y]) log_y <- "log_Sale_Price" x <- setdiff(names(train), c(y, log_y))

aml2 <- h2o.automl(y = log_y, x = x, training_frame = train, max_models = 10, seed = 1) print(aml2@leaderboard, n = nrow(aml2@leaderboard))

model_ids <- as.data.frame(aml2@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner) {code}

exalate-issue-sync[bot] commented 1 year ago

Tomas Fryda commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] Wouldn’t this be solved by choosing some {{distribution}} with a {{log}} link (e.g. Poisson, Gamma, Tweedie)?

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6536 Assignee: Tomas Fryda Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: Available (Count: 5) Development PRs: N/A

Attachments From Jira

Attachment Name: Screen Shot 2019-05-29 at 10.17.20 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.20 AM.png

Attachment Name: Screen Shot 2019-05-29 at 10.17.27 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.27 AM.png

Attachment Name: Screen Shot 2019-05-29 at 10.27.59 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.27.59 AM.png

Attachment Name: Screen Shot 2019-05-29 at 9.31.07 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.07 AM.png

Attachment Name: Screen Shot 2019-05-29 at 9.31.24 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.24 AM.png