h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

Connection reset by peer error when running model with > 100000 data rows #8878

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

The python script with H2o AutoML runs fine with less than 100000 rows of data on my Intel xeon processor running Centos 7. The python script was running on large data fine until July 18, 2019 when H20 AutoML started crashed when the training data exceeded 100000 rows crashing with this message:

{code:python}

(Cron Daemon) mjules@dreamworks.com 12:01 AM (14 hours ago) to me

/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/mlflow/utils/environment.py:26: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. env = yaml.load(_conda_header) Starting at 2019-08-02T00:00:03.950132-07:00 Starting Execution: just now Execution Done: in seconds Starting fetch process: in seconds Finished fetching: in seconds Closed connection in seconds Received 150420 items! Converting to feature list... Conversion completed just now Setting up training and test datasets... Finished just now Initializing H2O... Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_212"; OpenJDK Runtime Environment (build 1.8.0_212-b04); OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode) Starting server from /usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmps1Fz6h JVM stdout: /tmp/tmps1Fz6h/h2o_mjules_started_from_python.out JVM stderr: /tmp/tmps1Fz6h/h2o_mjules_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.


H2O cluster uptime: 01 secs H2O cluster timezone: America/Los_Angeles H2O data parsing timezone: UTC H2O cluster version: 3.26.0.1 H2O cluster version age: 17 days H2O cluster name: H2O_from_python_mjules_drsbuz H2O cluster total nodes: 1 H2O cluster free memory: 35.56 Gb H2O cluster total cores: 28 H2O cluster allowed cores: 28 H2O cluster status: accepting new members, healthy H2O connection url: http://127.0.0.1:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 2.7.5 final


Done... Running training... Parse progress: [#########################################################] 100% Parse progress: [#########################################################] 100% Using working directory: /tmp/tmp08LbOA AutoML progress: [#Traceback (most recent call last): File "/usr/pic1/repos/ml-job-prediction-models/python_test/ml-job-predictions.py", line 399, in main() File "/usr/pic1/repos/ml-job-prediction-models/python_test/ml-job-predictions.py", line 281, in main aml.train(x=x, y=y, training_frame=train, leaderboard_frame=test) File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/automl/autoh2o.py", line 445, in train self._job.poll(poll_updates=poll_updates) File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/job.py", line 57, in poll pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self)) File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/utils/progressbar.py", line 171, in execute res = progress_fn() # may raise StopIteration File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/job.py", line 94, in _refresh_job_status jobs = h2o.api("GET /3/Jobs/%s" % self.job_key) File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/h2o.py", line 104, in api return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) File "/usr/pic1/venv/mlflow_ex/lib/python2.7/site-packages/h2o/backend/connection.py", line 415, in request raise H2OConnectionError("Unexpected HTTP error: %s" % e) h2o.exceptions.H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', error(104, 'Connection reset by peer'))

{code}

exalate-issue-sync[bot] commented 1 year ago

Michael Jules commented: Additional Information: The JVM error output is:

{code}terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'what(): std::bad_alloc: out of memory{code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:59501c65-23f2-4a22-af15-313526e2c87e] this is coming from XGBoost, does your machine have a GPU?

[~accountid:5b153fb1b0d76456f36daced] I am not sure if there is a way of disabling GPU in AutoML right now. Can you advice if this is possible?

exalate-issue-sync[bot] commented 1 year ago

Michael Jules commented: yes, the issue is due to XGBoost. This is resolved by running: h2oAutoML(exclude_algos=['XGBoost']

On Thu, Aug 15, 2019 at 3:54 PM Michal Kurka (JIRA) <

--

Michael Jules Data Scientist Big Data & Analytics DreamWorks

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] no, we don’t have any specific flag for GPU on AutoML side. On client side, [~accountid:557058:59501c65-23f2-4a22-af15-313526e2c87e] solution is the only one.

Can’t we specify the native library when loading XGB? Using JVM param? This may appear useful. This way, one would be able to call {{h2o.init}} with additional JVM params: {{ho2.init(jvm_custom_args=['-Dparam_for_xgb_lib_type=cpu_only'])}} and still be able to use XGB

exalate-issue-sync[bot] commented 1 year ago

Gustavo Henrique Orair commented: As a workaround one may run H2O as a docker and does not install gcc and gcc-multilib inside the docker?

Ps: I have a very similar problem using a Multi-Node Cluster and XGBoost-OMP (not GPU).

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6755 Assignee: UNASSIGNED Reporter: Michael Jules State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A