h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

XGBoost throw Assertion Error For Categorical Encoding Set to one_hot_explicit #9321

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If there are different categorical levels in the train and validation dataset, and a user sets categorical_encoding to "one_hot_explicit" in XGBoost they will see a ERRR: java.lang.AssertionError, and the assertion that gets violated is here:

{code} assert (!expensive || _valid==null || Arrays.equals(_train._names, _valid._names) || _parms._categorical_encoding == Model.Parameters.CategoricalEncodingScheme.Binary); {code}

code to reproduce the issue can be found here: {code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init()

cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

convert response column to a factor

cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()

set the predictor names and the response column name

predictors = cars.columns response = "economy_20mpg" cars.impute()

split into train and validation sets

hf_train, hf_test = cars.split_frame(ratios = [.8], seed = 1234)

create a new level in the train frame but not the validation frame

hf_train['name'] = hf_train['name'].ascharacter() hf_train[1:30, 'name'] = ('lauren' ) hf_train['name'] = hf_train['name'].asfactor()

param = { "ntrees" : 500 , "max_depth" : 10 , "learn_rate" : 0.1 , "sample_rate" : 1.0 , "col_sample_rate_per_tree" : 1.0 , "min_rows" : 5 , "seed": 4241 , "score_tree_interval": 100 , "categorical_encoding": "one_hot_explicit" } from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = predictors, y = response, training_frame = hf_train, validation_frame = hf_test)
{code}

Note the if you remove the line"categorical_encoding": "one_hot_explicit" and use the default encoding xgboost runs just fine.

The stack trace {code} xgboost Model Build progress: | (failed)

OSError Traceback (most recent call last)

in () 36 from h2o.estimators import H2OXGBoostEstimator 37 model = H2OXGBoostEstimator(**param) ---> 38 model.train(x = predictors, y = response, training_frame = hf_train, validation_frame = hf_test) /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose) 233 return 234 --> 235 model.poll(verbose_model_scoring_history=verbose) 236 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0] 237 self._resolve_model(model.dest_key, model_json) /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/h2o/job.py in poll(self, verbose_model_scoring_history) 75 if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)): 76 raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: " ---> 77 "\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) 78 else: 79 raise EnvironmentError("Job with key %s failed with an exception: %s" % (self.job_key, self.exception)) OSError: Job with key $03017f00000132d4ffffffff$_9d062bb3031a6c01df3ec1ac581847d8 failed with an exception: java.lang.AssertionError stacktrace: java.lang.AssertionError at hex.ModelBuilder.init(ModelBuilder.java:1117) at hex.tree.xgboost.XGBoost.init(XGBoost.java:76) at hex.tree.xgboost.XGBoost$XGBoostDriver.computeImpl(XGBoost.java:245) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:214) at water.H2O$H2OCountedCompleter.compute(H2O.java:1269) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) {code} This was found in H2O-3 version 3.20.0.8
exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #93805 (https://support.h2o.ai/a/tickets/93805) - Re: H2O proxy issue

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6299 Assignee: Michal Kurka Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A