h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Apache License 2.0
6.87k stars 1.99k forks source link

XGBoost throw Assertion Error For Categorical Encoding Set to one_hot_explicit #9321

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If there are different categorical levels in the train and validation dataset, and a user sets categorical_encoding to "one_hot_explicit" in XGBoost they will see a ERRR: java.lang.AssertionError, and the assertion that gets violated is here:

{code} assert (!expensive || _valid==null || Arrays.equals(_train._names, _valid._names) || _parms._categorical_encoding == Model.Parameters.CategoricalEncodingScheme.Binary); {code}

code to reproduce the issue can be found here: {code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init()

cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

convert response column to a factor

cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()

set the predictor names and the response column name

predictors = cars.columns response = "economy_20mpg" cars.impute()

split into train and validation sets

hf_train, hf_test = cars.split_frame(ratios = [.8], seed = 1234)

create a new level in the train frame but not the validation frame

hf_train['name'] = hf_train['name'].ascharacter() hf_train[1:30, 'name'] = ('lauren' ) hf_train['name'] = hf_train['name'].asfactor()

param = { "ntrees" : 500 , "max_depth" : 10 , "learn_rate" : 0.1 , "sample_rate" : 1.0 , "col_sample_rate_per_tree" : 1.0 , "min_rows" : 5 , "seed": 4241 , "score_tree_interval": 100 , "categorical_encoding": "one_hot_explicit" } from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = predictors, y = response, training_frame = hf_train, validation_frame = hf_test)

Note the if you remove the line"categorical_encoding": "one_hot_explicit" and use the default encoding xgboost runs just fine.

The stack trace {code} xgboost Model Build progress: | (failed)

OSError Traceback (most recent call last)

in () 36 from h2o.estimators import H2OXGBoostEstimator 37 model = H2OXGBoostEstimator(**param) ---> 38 model.train(x = predictors, y = response, training_frame = hf_train, validation_frame = hf_test) /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose) 233 return 234 --> 235 model.poll(verbose_model_scoring_history=verbose) 236 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0] 237 self._resolve_model(model.dest_key, model_json) /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/h2o/job.py in poll(self, verbose_model_scoring_history) 75 if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)): 76 raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: " ---> 77 "\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) 78 else: 79 raise EnvironmentError("Job with key %s failed with an exception: %s" % (self.job_key, self.exception)) OSError: Job with key $03017f00000132d4ffffffff$_9d062bb3031a6c01df3ec1ac581847d8 failed with an exception: java.lang.AssertionError stacktrace: java.lang.AssertionError at hex.ModelBuilder.init(ModelBuilder.java:1117) at hex.tree.xgboost.XGBoost.init(XGBoost.java:76) at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:245) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:214) at water.H2O$H2OCountedCompleter.compute(H2O.java:1269) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) {code} This was found in H2O-3 version
exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #93805 (https://support.h2o.ai/a/tickets/93805) - Re: H2O proxy issue

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6299 Assignee: Michal Kurka Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A