Open exalate-issue-sync[bot] opened 1 year ago
Tomas Fryda commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] Wouldn’t this be solved by choosing some {{distribution}} with a {{log}} link (e.g. Poisson, Gamma, Tweedie)?
JIRA Issue Migration Info
Jira Issue: PUBDEV-6536 Assignee: Tomas Fryda Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: Available (Count: 5) Development PRs: N/A
Attachments From Jira
Attachment Name: Screen Shot 2019-05-29 at 10.17.20 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.20 AM.png
Attachment Name: Screen Shot 2019-05-29 at 10.17.27 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.17.27 AM.png
Attachment Name: Screen Shot 2019-05-29 at 10.27.59 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 10.27.59 AM.png
Attachment Name: Screen Shot 2019-05-29 at 9.31.07 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.07 AM.png
Attachment Name: Screen Shot 2019-05-29 at 9.31.24 AM.png Attached By: Erin LeDell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6536/Screen Shot 2019-05-29 at 9.31.24 AM.png
There are some datasets which have a skewed response, and if we don't log it before running AutoML (in particular, Stacked Ensemble GLM metalearner), we get bad results.
For regression problems, we must:
This was brought up on StackOverflow a while ago: https://stackoverflow.com/questions/48330026/how-to-handle-a-skewed-response-in-h2o-algorithms
Here is an example: {code} install.packages("AmesHousing") library(AmesHousing) ames <- make_ames()
library(h2o) h2o.init()
train <- as.h2o(ames) y <- "Sale_Price"
aml <- h2o.automl(y = y, training_frame = train, max_models = 10, seed = 1) print(aml@leaderboard, n = nrow(aml@leaderboard))
model_ids <- as.data.frame(aml@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner)
response is skewed
library(dplyr) hist(ames %>% pull(Sale_Price), breaks = 100)
Try again by logging the response
train[,"log_Sale_Price"] <- h2o.log(train[,y]) log_y <- "log_Sale_Price" x <- setdiff(names(train), c(y, log_y))
aml2 <- h2o.automl(y = log_y, x = x, training_frame = train, max_models = 10, seed = 1) print(aml2@leaderboard, n = nrow(aml2@leaderboard))
model_ids <- as.data.frame(aml2@leaderboard$model_id)[,1] se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) metalearner <- h2o.getModel(se@model$metalearner$name) h2o.varimp_plot(metalearner) h2o.coef_norm(metalearner) {code}