h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

gbm checkpoint failes when categorical_encoding set to "OneHotExplicit" #12698

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

When trying to use a gbm model as a checkpoint, when that model is using categorical_encoding = "OneHotExplicit", causes an error because the one hot columns are expected.

This should automatically encode the categorical variables in the same way as it did for the original model.

MRE:

{code:R} library(h2o) library(tibble)

set.seed(42) N <- 10000

Create dummy data

training_data <- tibble( x1 = rnorm(N), x2 = rnorm(N)^2, x3 = sample(c(1, 2, 3, 4), N, replace = TRUE), y = x1 - x2 + sign(x1 + x2) * x3 / 5+ rnorm(N) )

Set factor variable

training_data[, "x3"] <- factor(training_data[["x3"]], labels = c("a", "b", "c", "d"))

Introduce NAs to 1% of predictors

training_data[sample(1:N, N/100), "x1"] <- NA training_data[sample(1:N, N/100), "x2"] <- NA training_data[sample(1:N, N/100), "x3"] <- NA

Split for seperate training

first_train_set <- training_data[1:(N0.8), ] second_train_set <- training_data[(N0.8 + 1):N, ]

H2O

h2o.init()

first_train_hex <- as.h2o(first_train_set) second_train_hex <- as.h2o(second_train_set)

first_model <- h2o.gbm( x = c("x1", "x2", "x3"), y = "y", training_frame = first_train_hex,

ntrees = 20,

categorical_encoding = "OneHotExplicit",

seed = 42

)

second_model <- h2o.gbm( checkpoint = first_model@model_id,

x = c("x1", "x2", "x3"),
y = "y",
training_frame = second_train_hex,

ntrees = 100,

seed = 24

) {code}

Gives the error: {code:R} ERROR: Unexpected HTTP Status code: 400 Bad Request (url = http://localhost:54321/3/ModelBuilders/gbm)

java.lang.IllegalArgumentException [1] "java.lang.IllegalArgumentException: The columns of the training data must be the same as for the checkpointed model" [2] " hex.tree.SharedTree.init(SharedTree.java:130)"
[3] " hex.tree.gbm.GBM.init(GBM.java:57)"
[4] " water.api.ModelBuilderHandler.handle(ModelBuilderHandler.java:60)"
[5] " water.api.ModelBuilderHandler.handle(ModelBuilderHandler.java:17)"
[6] " water.api.RequestServer.serve(RequestServer.java:451)"
[7] " water.api.RequestServer.doGeneric(RequestServer.java:296)"
[8] " water.api.RequestServer.doPost(RequestServer.java:222)"
[9] " javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"
[10] " javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"
[11] " org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"
[12] " org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"
[13] " org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"
[14] " org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"
[15] " org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"
[16] " org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"
[17] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[18] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[19] " water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197)"
[20] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
[21] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
[22] " org.eclipse.jetty.server.Server.handle(Server.java:370)"
[23] " org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"
[24] " org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"
[25] " org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"
[26] " org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"
[27] " org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"
[28] " org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"
[29] " org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"
[30] " org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"
[31] " org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"
[32] " org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"
[33] " java.lang.Thread.run(Thread.java:745)"

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :

ERROR MESSAGE:

The columns of the training data must be the same as for the checkpointed model {code}

Version info

{code:R}

sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] tibble_1.4.1 h2o_3.20.0.2 RevoUtils_10.0.8 RevoUtilsMath_10.0.1

loaded via a namespace (and not attached): [1] compiler_3.4.3 tools_3.4.3 pillar_1.0.1 RCurl_1.95-4.9 yaml_2.1.16 jsonlite_1.5 rlang_0.1.6 bitops_1.0-6

H2O

h2o.init()

H2O is not running yet, starting it now...

Note: In case of errors look at the following log files: C:\Users\JHARTS~1\AppData\Local\Temp\2\RtmpOEbfqU/h2o_JHartshorn_started_from_r.out C:\Users\JHARTS~1\AppData\Local\Temp\2\RtmpOEbfqU/h2o_JHartshorn_started_from_r.err

java version "1.8.0_74" Java(TM) SE Runtime Environment (build 1.8.0_74-b02) Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)

Starting H2O JVM and connecting: Connection successful!

R is connected to the H2O cluster: H2O cluster uptime: 4 seconds 617 milliseconds H2O cluster timezone: Europe/London H2O data parsing timezone: UTC H2O cluster version: 3.20.0.2 H2O cluster version age: 2 months and 5 days
H2O cluster name: H2O_started_from_R_JHartshorn_xbu206 H2O cluster total nodes: 1 H2O cluster total memory: 26.67 GB H2O cluster total cores: 16 H2O cluster allowed cores: 16 H2O cluster healthy: TRUE H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA H2O Internal Security: FALSE H2O API Extensions: Algos, AutoML, Core V3, Core V4 R Version: R version 3.4.3 (2017-11-30) {code}

exalate-issue-sync[bot] commented 1 year ago

Joe Hart commented: It's worth noting that setting the categorical_encoding parameter in the second call to h2o.gbm() does not help.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5846 Assignee: New H2O Bugs Reporter: Joe Hart State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A