h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

H2OXGBoostEstimator always predicts same value #15445

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Hi,

the xgboost classifier (via h2o) always predicts the same value on my machine.

I filed this issue on H2O Google Groups and had an initial conversation with Lauren there. I was asked to open a Jira ticket, which you find here, with attached log files. A description of the issue and the previous conversations can be found attached.

Best Chris

------------------------------------------------------------------------------------------------------ [^logs]

I ran the example code from the link below to train an xgboost classifier via h2o. https://blog.h2o.ai/2017/06/xgboost-in-h2o-machine-learning-platform/

H2O version: 3.20.0.7 Python version: 3.5.5 GPU: NVIDIA TESLA V100

The issue is that the resulting predictions are exactly equal to 0.5 for both classes, across all examples.

The issue also occurs when I train on CPU, on other machines (same h2o version, CPU training), and on other data science problems (apart from the higgs data set).

Comment 1: When I change the backend to CPU, the training fails with this error:

OSError: Job with key $03017f00000132d4ffffffff$_b2c76a32092d51d7b30833ac413b44f8 failed with an exception: java.lang.IllegalStateException: Cannot perform booster operation: updater is inactive on node /127.0.0.1:54321

It can be resolved by excluding sample_rate and col_sample_rate_per_tree from the params dictionary, thus reverting these parameters to default. Still, predictions for both classes all equal to 0.5

Comment 2: When I change the learning algorithm to H2OGradientBoostingEstimator (or a different, suitable H2O native algo), the predictions are reasonable.

Can you please advise?

Thanks, Chris

import h2o h2o.init()

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv

Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

train_path = '/home/chris/github_artellium/higgs_train_imbalance_100k.csv' test_path = '/home/chris/github_artellium/higgs_test_imbalance_100k.csv'

df_train = h2o.import_file(train_path) df_test = h2o.import_file(test_path)

Transform first feature into categorical feature

df_train[0] = df_train[0].asfactor() df_test[0] = df_test[0].asfactor()

param = { "ntrees" : 100 , "max_depth" : 10 , "learn_rate" : 0.02 , "sample_rate" : 0.7 , "col_sample_rate_per_tree" : 0.9 , "min_rows" : 5 , "seed": 4241 , "score_tree_interval": 100 }

from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train)

prediction = model.predict(df_test)

pred = prediction.as_data_frame() pred.describe()


Hi Chris,

I ran your example with python 3.5.1 and H2O 3.20.0.7 and wasn't able to reproduce your results. Could you post your output?

Here is what I got, when running your code:

prediction predict p0 p1


0 0.931853 0.0681466 0 0.930701 0.0692988 0 0.931211 0.0687895 0 0.931097 0.0689028 0 0.931698 0.0683018 0 0.931397 0.0686025 0 0.932725 0.0672752 0 0.931293 0.0687067 0 0.929148 0.0708518 0 0.931588 0.0684123


Hi Lauren,

thank you very much for investigating. Interesting that you get different results - your prediction definitely look more reasonable. The model output I get is attached at the bottom.

Also, I have retrieved the logs with h2o.download_all_logs() as plain text. Can you please provide me with an e-mail address to which I can send them? I may not attach files here (at least I do not see the attach files button) and they are too large for posting them in a message. Note that the warn, error, and fatal logs are empty though.

Best Chris

OUTPUTS

pred.describe() Out[3]: predict p0 p1 count 100000.0 100000.0 100000.0 mean 1.0 0.5 0.5 std 0.0 0.0 0.0 min 1.0 0.5 0.5 25% 1.0 0.5 0.5 50% 1.0 0.5 0.5 75% 1.0 0.5 0.5 max 1.0 0.5 0.5

OR

In [5]: pred Out[5]: predict p0 p1 0 1 0.5 0.5 1 1 0.5 0.5 2 1 0.5 0.5 3 1 0.5 0.5 4 1 0.5 0.5 5 1 0.5 0.5 6 1 0.5 0.5 7 1 0.5 0.5 8 1 0.5 0.5 9 1 0.5 0.5 10 1 0.5 0.5 11 1 0.5 0.5 12 1 0.5 0.5 13 1 0.5 0.5 14 1 0.5 0.5 15 1 0.5 0.5 16 1 0.5 0.5 17 1 0.5 0.5 18 1 0.5 0.5 19 1 0.5 0.5 20 1 0.5 0.5 21 1 0.5 0.5


exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: [~accountid:5babf5102cbf0669ce91834f] I was not able to open your log files, would it be possible for you to provide the zipped logs you get when you run h2o.download_all_logs(dirname=u'.', filename=None) : http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?#h2o.download_all_logs

Thanks!

exalate-issue-sync[bot] commented 1 year ago

Chris A. commented: Hi Lauren, I added the logs again. I think the issue was just that the filename extension was missing (.zip). Now, it should work. Best Chris

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: additional informational:

"In case this helps, I've got the same error message (Cannot perform booster operation: updater is inactive on node /127.0.0.1:54321) when training a xgboost model. This started occurring after updating to macOS Mojave. Didn't have it before. "

from marten on original stream post.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5a32df017dcf343865c26fa5], please try to reproduce on Tesla V100

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: FYI from the logs:

09-28 18:28:58.764 127.0.0.1:54321 29331 FJ-1-25 INFO: Using GPU backend (gpu_id: 0). 09-28 18:28:58.764 127.0.0.1:54321 29331 FJ-1-25 INFO: Using grow_gpu_hist (approximate) updater.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Result on Tesla, latest master.

{code:python}

from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid) xgboost Model Build progress: |███████████████████████████████████████████████████████████████| 100% prediction = model.predict(df_valid)[:,2] xgboost prediction progress: |████████████████████████████████████████████████████████████████| 100% prediction p1

0.0677212 0.0694259 0.0690649 0.0711024 0.0697321 0.0700099 0.0672352 0.0693949 0.0717103 0.0679422

[100000 rows x 1 column]

{code}

A proof GPU was used:

{code:java} 11-19 12:08:24.276 172.31.31.201:54321 5813 #81257-16 INFO: GET /3/Models/XGBoost_model_python_1542628365898_1, parms: {} 11-19 12:08:32.840 172.31.31.201:54321 5813 #81257-17 INFO: POST /4/Predictions/models/XGBoost_model_python_1542628365898_1/frames/py_1_sid_ba5e, parms: {} 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: Using GPU backend (gpu_id: 0). 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: Using grow_gpu_hist (approximate) updater. 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: XGBoost Parameters: 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: colsample_bytree = 0.9 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: silent = true 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: tree_method = exact 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: seed = 4241 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_depth = 10 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: gpu_id = 0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_bins = 256 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: booster = gbtree 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: nround = 100 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: updater = grow_gpu_hist 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: objective = binary:logistic 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: lambda = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: eta = 0.02 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: grow_policy = depthwise 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: nthread = 8 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: alpha = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: subsample = 0.7 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: colsample_bylevel = 1.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_delta_step = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: min_child_weight = 5.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: gamma = 0.0 {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Tried to reproduce on 3.20.0.7.

Clean build from jenkins-3.20.0.7 tag. Did a clean installation of Python client as well (see logs) to match the release version.

{code:python} ubuntu@ip-172-31-31-201:~/h2o-3$ python3 Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information.

import h2o versionFromGradle='3.20.0',projectVersion='3.20.0.99999',branch='(HEAD detached at jenkins-3.20.0.7)',lastCommitHash='fd010dcc75d22268976d755bcbfa3d119c99d88a',gitDescribe='jenkins-3.20.0.7',compiledOn='2018-11-19 12:16:34',compiledBy='ubuntu' h2o.init( ... ) Checking whether there is an H2O instance running at http://localhost:54321. connected. versionFromGradle='3.20.0',projectVersion='3.20.0.99999',branch='(HEAD detached at jenkins-3.20.0.7)',lastCommitHash='fd010dcc75d22268976d755bcbfa3d119c99d88a',gitDescribe='jenkins-3.20.0.7',compiledOn='2018-11-19 12:16:34',compiledBy='ubuntu'


H2O cluster uptime: 40 secs H2O cluster timezone: Etc/UTC H2O data parsing timezone: UTC H2O cluster version: 3.20.0.99999 H2O cluster version age: 22 minutes H2O cluster name: ubuntu H2O cluster total nodes: 1 H2O cluster free memory: 13.33 Gb H2O cluster total cores: 8 H2O cluster allowed cores: 8 H2O cluster status: accepting new members, healthy H2O connection url: http://localhost:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 3.5.2 final


train_path = 'higgs_train_imbalance_100k.csv' test_path = 'higgs_test_imbalance_100k.csv' df_train = h2o.import_file(train_path) Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% df_valid = h2o.import_file(test_path) Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% df_train[0] = df_train[0].asfactor() df_valid[0] = df_valid[0].asfactor() param = { ... "ntrees" : 100 ... , "max_depth" : 10 ... , "learn_rate" : 0.02 ... , "sample_rate" : 0.7 ... , "col_sample_rate_per_tree" : 0.9 ... , "min_rows" : 5 ... , "seed": 4241 ... , "score_tree_interval": 100 ... } from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid) xgboost Model Build progress: |███████████████████████████████████████████████████████████████| 100% prediction = model.predict(df_valid)[:,2] xgboost prediction progress: |████████████████████████████████████████████████████████████████| 100% prediction p1

0.0677212 0.0694259 0.0690649 0.0711024 0.0697321 0.0700099 0.0672352 0.0693949 0.0717103 0.0679422

[100000 rows x 1 column]

{code}

Again, a snippet from the logs to prove GPU has been used:

{code:java} 11-19 12:48:59.454 172.31.31.201:54321 7717 #25387-16 INFO: POST /4/Predictions/models/XGBoost_model_python_1542630994458_1/frames/py_2_sid_8c67, parms: {} 11-19 12:48:59.457 172.31.31.201:54321 7717 FJ-1-1 INFO: Using GPU backend (gpu_id: 0). 11-19 12:48:59.457 172.31.31.201:54321 7717 FJ-1-1 INFO: Using grow_gpu_hist (approximate) updater. {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Also tried CPU backend on both master & 3.20.0.7 - could not reproduce.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Also tried both variants on a cluster of 3. Could not reproduce.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: {code:java} ubuntu@xxxxxxx:~/h2o-3$ nvidia-smi Mon Nov 19 13:40:21 2018
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 41W / 300W | 436MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 8988 C java 426MiB | +-----------------------------------------------------------------------------+

{code}

I think the problem might simply be in timeout. Do we know how fast the model has been built ? The interval to wait is 60 seconds.The following piece of code produced the error. Might just be a timeout and/or local network issue.

{code:java} @SuppressWarnings("unchecked") private T invoke(BoosterCallable callable) throws InterruptedException { final SynchronousQueue<BoosterCallable<?>> inQ = _in; if (inQ == null) throw new IllegalStateException("Updater is inactive on node " + H2O.SELF); if (! inQ.offer(callable, WORK_START_TIMEOUT_SECS, TimeUnit.SECONDS)) throw new IllegalStateException("XGBoostUpdater couldn't start work on task " + callable + " in " + WORK_START_TIMEOUT_SECS + "s."); SynchronousQueue<?> outQ; while ((outQ = _out) != null) { T result = (T) outQ.poll(INACTIVE_CHECK_INTERVAL_SECS, TimeUnit.SECONDS); if (result != null) return result; } throw new IllegalStateException("Cannot perform booster operation: updater is inactive on node " + H2O.SELF); } {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:5babf5102cbf0669ce91834f], we're unable to reproduce your failure.

Due to the nature of the bug, it might be a problem with your local network/security policy. Might be true even for single-node setup. There are no issues on an unrestricted system. with exactly the same GPU, dataset and script used.

Can you please restart your H2O instance or try it somewhere else ? Are you trying to run H2O in a protected environment (e.g. company server with AppArmor and strict networking rules or company laptop with restricted privileges) ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Okay, I think we will close with "Cannot reproduce" for now. [~accountid:5babf5102cbf0669ce91834f], if you encounter the issue again. Please do not hesitate to re-open the issue. We are adding additional logging in 3.22.0.2 release, please use this release once available and attach your H2O logs.

exalate-issue-sync[bot] commented 1 year ago

Sandro Casagrande commented: Hi,

my attempts to use XGB inside of H2O also fail due to the exact same reasons reported by Chris A. I used the same code snippets and Higgs train/test sets from the blog and installed H2O in a fresh conda environment in various different ways on 3 machines, running Ubuntu 16.04 and 18.04, all with the same result.

More precisely, I tried setting up h2o with:

{noformat}

conda create --yes -n h2o_testing python=3.7 jupyter conda activate h2o_testing pip install h2o {noformat}

{noformat}

conda create --yes -n h2o_testing python=3.6 anaconda conda activate h2o_testing conda config --append channels conda-forge conda install -c h2oai h2o {noformat}

{noformat}

sudo apt-get install libomp-dev conda create --yes -n h2o_testing python=3.6 jupyter conda activate h2o_testing pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o {noformat}

I'll add the logs for all 3 attempts to the attachments.

I'm a bit puzzled, since this seems to be the most mainstream way to test & use this. Thanks for looking into it or any advice if I might be doing something wrong.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thanks [~accountid:5c33d81edae8a34f82083b30] , we'll look into it.

exalate-issue-sync[bot] commented 1 year ago

Andreas commented: Hi everyone,

we are having exactly the same issue.

Thanks!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Might have something to do with interface H2O binds to and Rabit binding a to different one.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: I inspected the logs, this happens when the boosting iteration fails (XGBoost produces XGBoostError). Unfortunately, current handling of the error is not correct - we effectively throw away the reason why boosting failed.

A first step is to fix the error handling.

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Hi [~accountid:5babf5102cbf0669ce91834f], [~accountid:5c6b4c65efc9686293aba6c1], we have added some additional logging and I would like to aks you to run your test again and send us the logs. Please use the latest nightly http://h2o-release.s3.amazonaws.com/h2o/master/latest.html Thank you

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: I have narrowed the issue to some non-deterministic behaviour in native predict code. The workaround for now is to run h2o with java-based predict implementation this can be done via setting the system property sys.ai.h2o.xgboost.predict.java.enable to true

{code} java -Dsys.ai.h2o.xgboost.predict.java.enable=true ... -jar h2o.jar {code}

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: [~accountid:5babf5102cbf0669ce91834f] I was not able to open your log files, would it be possible for you to provide the zipped logs you get when you run h2o.download_all_logs(dirname=u'.', filename=None) : http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/h2o.html?#h2o.download_all_logs

Thanks!

exalate-issue-sync[bot] commented 1 year ago

Chris A. commented: Hi Lauren, I added the logs again. I think the issue was just that the filename extension was missing (.zip). Now, it should work. Best Chris

exalate-issue-sync[bot] commented 1 year ago

Lauren DiPerna commented: additional informational:

"In case this helps, I've got the same error message (Cannot perform booster operation: updater is inactive on node /127.0.0.1:54321) when training a xgboost model. This started occurring after updating to macOS Mojave. Didn't have it before. "

from marten on original stream post.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:5a32df017dcf343865c26fa5], please try to reproduce on Tesla V100

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: FYI from the logs:

09-28 18:28:58.764 127.0.0.1:54321 29331 FJ-1-25 INFO: Using GPU backend (gpu_id: 0). 09-28 18:28:58.764 127.0.0.1:54321 29331 FJ-1-25 INFO: Using grow_gpu_hist (approximate) updater.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Result on Tesla, latest master.

{code:python}

from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid) xgboost Model Build progress: |███████████████████████████████████████████████████████████████| 100% prediction = model.predict(df_valid)[:,2] xgboost prediction progress: |████████████████████████████████████████████████████████████████| 100% prediction p1

0.0677212 0.0694259 0.0690649 0.0711024 0.0697321 0.0700099 0.0672352 0.0693949 0.0717103 0.0679422

[100000 rows x 1 column]

{code}

A proof GPU was used:

{code:java} 11-19 12:08:24.276 172.31.31.201:54321 5813 #81257-16 INFO: GET /3/Models/XGBoost_model_python_1542628365898_1, parms: {} 11-19 12:08:32.840 172.31.31.201:54321 5813 #81257-17 INFO: POST /4/Predictions/models/XGBoost_model_python_1542628365898_1/frames/py_1_sid_ba5e, parms: {} 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: Using GPU backend (gpu_id: 0). 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: Using grow_gpu_hist (approximate) updater. 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: XGBoost Parameters: 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: colsample_bytree = 0.9 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: silent = true 11-19 12:08:32.843 172.31.31.201:54321 5813 FJ-1-7 INFO: tree_method = exact 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: seed = 4241 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_depth = 10 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: gpu_id = 0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_bins = 256 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: booster = gbtree 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: nround = 100 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: updater = grow_gpu_hist 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: objective = binary:logistic 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: lambda = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: eta = 0.02 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: grow_policy = depthwise 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: nthread = 8 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: alpha = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: subsample = 0.7 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: colsample_bylevel = 1.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: max_delta_step = 0.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: min_child_weight = 5.0 11-19 12:08:32.844 172.31.31.201:54321 5813 FJ-1-7 INFO: gamma = 0.0 {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Tried to reproduce on 3.20.0.7.

Clean build from jenkins-3.20.0.7 tag. Did a clean installation of Python client as well (see logs) to match the release version.

{code:python} ubuntu@ip-172-31-31-201:~/h2o-3$ python3 Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information.

import h2o versionFromGradle='3.20.0',projectVersion='3.20.0.99999',branch='(HEAD detached at jenkins-3.20.0.7)',lastCommitHash='fd010dcc75d22268976d755bcbfa3d119c99d88a',gitDescribe='jenkins-3.20.0.7',compiledOn='2018-11-19 12:16:34',compiledBy='ubuntu' h2o.init( ... ) Checking whether there is an H2O instance running at http://localhost:54321. connected. versionFromGradle='3.20.0',projectVersion='3.20.0.99999',branch='(HEAD detached at jenkins-3.20.0.7)',lastCommitHash='fd010dcc75d22268976d755bcbfa3d119c99d88a',gitDescribe='jenkins-3.20.0.7',compiledOn='2018-11-19 12:16:34',compiledBy='ubuntu'


H2O cluster uptime: 40 secs H2O cluster timezone: Etc/UTC H2O data parsing timezone: UTC H2O cluster version: 3.20.0.99999 H2O cluster version age: 22 minutes H2O cluster name: ubuntu H2O cluster total nodes: 1 H2O cluster free memory: 13.33 Gb H2O cluster total cores: 8 H2O cluster allowed cores: 8 H2O cluster status: accepting new members, healthy H2O connection url: http://localhost:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 3.5.2 final


train_path = 'higgs_train_imbalance_100k.csv' test_path = 'higgs_test_imbalance_100k.csv' df_train = h2o.import_file(train_path) Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% df_valid = h2o.import_file(test_path) Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% df_train[0] = df_train[0].asfactor() df_valid[0] = df_valid[0].asfactor() param = { ... "ntrees" : 100 ... , "max_depth" : 10 ... , "learn_rate" : 0.02 ... , "sample_rate" : 0.7 ... , "col_sample_rate_per_tree" : 0.9 ... , "min_rows" : 5 ... , "seed": 4241 ... , "score_tree_interval": 100 ... } from h2o.estimators import H2OXGBoostEstimator model = H2OXGBoostEstimator(**param) model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid) xgboost Model Build progress: |███████████████████████████████████████████████████████████████| 100% prediction = model.predict(df_valid)[:,2] xgboost prediction progress: |████████████████████████████████████████████████████████████████| 100% prediction p1

0.0677212 0.0694259 0.0690649 0.0711024 0.0697321 0.0700099 0.0672352 0.0693949 0.0717103 0.0679422

[100000 rows x 1 column]

{code}

Again, a snippet from the logs to prove GPU has been used:

{code:java} 11-19 12:48:59.454 172.31.31.201:54321 7717 #25387-16 INFO: POST /4/Predictions/models/XGBoost_model_python_1542630994458_1/frames/py_2_sid_8c67, parms: {} 11-19 12:48:59.457 172.31.31.201:54321 7717 FJ-1-1 INFO: Using GPU backend (gpu_id: 0). 11-19 12:48:59.457 172.31.31.201:54321 7717 FJ-1-1 INFO: Using grow_gpu_hist (approximate) updater. {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Also tried CPU backend on both master & 3.20.0.7 - could not reproduce.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Also tried both variants on a cluster of 3. Could not reproduce.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: {code:java} ubuntu@xxxxxxx:~/h2o-3$ nvidia-smi Mon Nov 19 13:40:21 2018
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 41W / 300W | 436MiB / 16160MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 8988 C java 426MiB | +-----------------------------------------------------------------------------+

{code}

I think the problem might simply be in timeout. Do we know how fast the model has been built ? The interval to wait is 60 seconds.The following piece of code produced the error. Might just be a timeout and/or local network issue.

{code:java} @SuppressWarnings("unchecked") private T invoke(BoosterCallable callable) throws InterruptedException { final SynchronousQueue<BoosterCallable<?>> inQ = _in; if (inQ == null) throw new IllegalStateException("Updater is inactive on node " + H2O.SELF); if (! inQ.offer(callable, WORK_START_TIMEOUT_SECS, TimeUnit.SECONDS)) throw new IllegalStateException("XGBoostUpdater couldn't start work on task " + callable + " in " + WORK_START_TIMEOUT_SECS + "s."); SynchronousQueue<?> outQ; while ((outQ = _out) != null) { T result = (T) outQ.poll(INACTIVE_CHECK_INTERVAL_SECS, TimeUnit.SECONDS); if (result != null) return result; } throw new IllegalStateException("Cannot perform booster operation: updater is inactive on node " + H2O.SELF); } {code}

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:5babf5102cbf0669ce91834f], we're unable to reproduce your failure.

Due to the nature of the bug, it might be a problem with your local network/security policy. Might be true even for single-node setup. There are no issues on an unrestricted system. with exactly the same GPU, dataset and script used.

Can you please restart your H2O instance or try it somewhere else ? Are you trying to run H2O in a protected environment (e.g. company server with AppArmor and strict networking rules or company laptop with restricted privileges) ?

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Okay, I think we will close with "Cannot reproduce" for now. [~accountid:5babf5102cbf0669ce91834f], if you encounter the issue again. Please do not hesitate to re-open the issue. We are adding additional logging in 3.22.0.2 release, please use this release once available and attach your H2O logs.

exalate-issue-sync[bot] commented 1 year ago

Sandro Casagrande commented: Hi,

my attempts to use XGB inside of H2O also fail due to the exact same reasons reported by Chris A. I used the same code snippets and Higgs train/test sets from the blog and installed H2O in a fresh conda environment in various different ways on 3 machines, running Ubuntu 16.04 and 18.04, all with the same result.

More precisely, I tried setting up h2o with:

{noformat}

conda create --yes -n h2o_testing python=3.7 jupyter conda activate h2o_testing pip install h2o {noformat}

{noformat}

conda create --yes -n h2o_testing python=3.6 anaconda conda activate h2o_testing conda config --append channels conda-forge conda install -c h2oai h2o {noformat}

{noformat}

sudo apt-get install libomp-dev conda create --yes -n h2o_testing python=3.6 jupyter conda activate h2o_testing pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o {noformat}

I'll add the logs for all 3 attempts to the attachments.

I'm a bit puzzled, since this seems to be the most mainstream way to test & use this. Thanks for looking into it or any advice if I might be doing something wrong.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Thanks [~accountid:5c33d81edae8a34f82083b30] , we'll look into it.

exalate-issue-sync[bot] commented 1 year ago

Andreas commented: Hi everyone,

we are having exactly the same issue.

Thanks!

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Might have something to do with interface H2O binds to and Rabit binding a to different one.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: I inspected the logs, this happens when the boosting iteration fails (XGBoost produces XGBoostError). Unfortunately, current handling of the error is not correct - we effectively throw away the reason why boosting failed.

A first step is to fix the error handling.

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Hi [~accountid:5babf5102cbf0669ce91834f], [~accountid:5c6b4c65efc9686293aba6c1], we have added some additional logging and I would like to aks you to run your test again and send us the logs. Please use the latest nightly http://h2o-release.s3.amazonaws.com/h2o/master/latest.html Thank you

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: I have narrowed the issue to some non-deterministic behaviour in native predict code. The workaround for now is to run h2o with java-based predict implementation this can be done via setting the system property sys.ai.h2o.xgboost.predict.java.enable to true

{code} java -Dsys.ai.h2o.xgboost.predict.java.enable=true ... -jar h2o.jar {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5937 Assignee: Jan Sterba Reporter: Chris A. State: Resolved Fix Version: 3.24.0.4 Attachments: Available (Count: 7) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3456 https://github.com/h2oai/h2o-3/pull/3457 https://github.com/h2oai/h2o-3/pull/3510 https://github.com/h2oai/h2o-3/pull/3059 https://github.com/h2oai/ml-benchmark/pull/40

Attachments From Jira

Attachment Name: h2ologs_20180928_062924.zip Attached By: Chris A. File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/h2ologs_20180928_062924.zip

Attachment Name: h2ologs_20190107_103253.zip Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/h2ologs_20190107_103253.zip

Attachment Name: h2ologs_20190107_111615.zip Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/h2ologs_20190107_111615.zip

Attachment Name: h2ologs_20190107_114358.zip Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/h2ologs_20190107_114358.zip

Attachment Name: test_xgboost_fails_20190107_103253.html Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/test_xgboost_fails_20190107_103253.html

Attachment Name: test_xgboost_fails_20190107_111615.html Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/test_xgboost_fails_20190107_111615.html

Attachment Name: test_xgboost_fails_20190107_114358.html Attached By: Sandro Casagrande File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5937/test_xgboost_fails_20190107_114358.html