h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Failed to establish a new connection: [Errno 111] Connection refused #15662

Open yanwun opened 1 year ago

yanwun commented 1 year ago

H2O version, Operating System and Environment

Checking whether there is an H2O instance running at http://127.0.0.1:54321. connected.


H2O_cluster_uptime: 5 mins 08 secs H2O_cluster_timezone: Etc/UTC H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.42.0.1 H2O_cluster_version_age: 1 month and 7 days H2O_cluster_name: H2O_from_python_unknownUser_2l8ak7 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 7.004 Gb H2O_cluster_total_cores: 2 H2O_cluster_allowed_cores: 2 H2O_cluster_status: locked, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.9.2 final


Actual behavior Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s.... Expected behavior Training automl model.

Steps to reproduce Steps to reproduce the behavior (with working code on a sample dataset, if possible):

  1. I use it in docker container, but sometimes it will crash with connection problem just like above.

Upload logs


H2O_cluster_uptime: 5 mins 08 secs H2O_cluster_timezone: Etc/UTC H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.42.0.1 H2O_cluster_version_age: 1 month and 7 days H2O_cluster_name: H2O_from_python_unknownUser_2l8ak7 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 7.004 Gb H2O_cluster_total_cores: 2 H2O_cluster_allowed_cores: 2 H2O_cluster_status: locked, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.9.2 final


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% features [{'name': 'ORP', 'data_type': 'Numerical', 'default': -228.55}, {'name': 'PH', 'data_type': 'Numerical', 'default': 6.73}, {'name': 'MLSS', 'data_type': 'Numerical', 'default': 3930}, {'name': 'DO', 'data_type': 'Numerical', 'default': 0.0}, {'name': 'NO2-N+NO3-N', 'data_type': 'Numerical', 'default': 0.187419355}, {'name': 'PO4-P(mg/l)', 'data_type': 'Numerical', 'default': 8.856210526}, {'name': 'COD(mg/l)', 'data_type': 'Numerical', 'default': 32.63657957}] targets ['NH4-N(mg/l)'] [2023-07-28 06:55:27,504 - INFO - engine.py:457] --- data preprocessing done!! --- AutoML progress: | 06:55:27.513: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

████ 06:55:44.3: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 135.0.

███████Job request failed Unexpected HTTP error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421c0ee0>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421c0790>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421bef40>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4215b970>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c42162310>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c42162c70>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421621c0>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421be580>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4215bb80>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4215b730>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4216c610>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4216cf70>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4215b730>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4215bd30>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c421be250>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c4216c4c0>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c42162040>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2c42173910>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_b05803844247e0f935954e5db64e4fd4 , will retry after 3s. Closing connection _sid_8304 at exit H2O session _sid_8304 closed.

Please help me, I want use it and training in docker container, but i need to init() each time, this is my situation.

wendycwong commented 1 year ago

Hey:

Here is the problem:

06:55:44.3: _min_rows param, The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 135.0.

Can you use a bigger dataset?

Also, if you can provide the exact code and dataset, that will be great. We need to reproduce the error and then fix it. Thanks, Wendy

yanwun commented 1 year ago

Hey, thank you for your help. I fixed the problem about row datas less than 200. But another issued just like below, and it usually happened when i was training in docker container.

My Data preprocessing just like below, df = pd.read_csv(data_file)[columns] df = df.dropna() df_h2o = h2o.H2OFrame(df) train_h2o_df, val_h2o_df = df_h2o.split_frame(ratios=[self.args.data_split_train_size], seed=1)

==== regression ===== def aml_setting(self): self.aml = H2OAutoML(max_models=20, max_runtime_secs=self.args.training_time_limit_in_minutes * 60 , seed=1, sort_metric="mse")

def _train_model(self): if self.df_test is not None and len(self.df_test) != 0: self.aml.train(x=self.features_name, y=self.targets_name[0], training_frame=self._df_train_h2o, validation_frame=self.test_h2o_df) else: self.aml.train(x=self.features_name, y=self.targets_name[0], training_frame=self._df_train_h2o, validation_frame=self._df_val_h2o)

My Training Log like below,


H2O_cluster_uptime: 26 secs H2O_cluster_timezone: Etc/UTC H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.42.0.1 H2O_cluster_version_age: 1 month and 11 days H2O_cluster_name: H2O_from_python_unknownUser_t12170 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 13.89 Gb H2O_cluster_total_cores: 8 H2O_cluster_allowed_cores: 8 H2O_cluster_status: locked, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False Python_version: 3.9.2 final


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% features [{'name': 'L2BL30_FAB_TET_AVG', 'data_type': 'Numerical', 'default': 24.66889381}, {'name': 'L2BL30_FAB_MET_AVG', 'data_type': 'Numerical', 'default': 52.27465057}, {'name': 'L2BL20_MAU_OA_TET_01', 'data_type': 'Numerical', 'default': 26.32016754}, {'name': 'L2BL20_MAU_OA_MET_01', 'data_type': 'Numerical', 'default': 77.06163025}, {'name': 'L2BL30_MAU_SC_301_MO', 'data_type': 'Numerical', 'default': 30}, {'name': 'L2BL30_MAU_SC_302_MO', 'data_type': 'Numerical', 'default': 30.0}, {'name': 'L2BL30_MAU_SC_303_MO', 'data_type': 'Numerical', 'default': 30.0}, {'name': 'L2BL30_MAU_PCC_TC_301_FB', 'data_type': 'Numerical', 'default': 23.94386292}, {'name': 'L2BL30_MAU_PCC_TC_303_FB', 'data_type': 'Numerical', 'default': 0.144675925}, {'name': 'L2BL30_MAU_DHC_MC_301_FB', 'data_type': 'Numerical', 'default': 17.29600525}, {'name': 'L2BL30_MAU_DHC_MC_302_FB', 'data_type': 'Numerical', 'default': 15.8926506}, {'name': 'L2BL30_MAU_DHC_MC_303_FB', 'data_type': 'Numerical', 'default': 0.752314806}, {'name': 'L2BL30_MAU_SA_DPT_301', 'data_type': 'Numerical', 'default': 11.55237293}, {'name': 'L2BL30_MAU_SA_DPT_302', 'data_type': 'Numerical', 'default': 11.50897026}, {'name': 'L2BL30_MAU_SA_DPT_303', 'data_type': 'Numerical', 'default': 13.70804501}, {'name': 'CH_KW_TOTAL_14', 'data_type': 'Numerical', 'default': 1501}, {'name': 'CH_KW_RTL_14', 'data_type': 'Numerical', 'default': 0.488341928}, {'name': 'CH_KW_TOTAL_7', 'data_type': 'Numerical', 'default': 453}] targets ['L2BL30_MAU_PCC_TC_302_FB'] [2023-08-01 03:41:11,525 - INFO - engine.py:445] --- data preprocessing done!! --- AutoML progress: | 03:41:11.539: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.

█████████████████████████████████████████████████████████████Failed polling AutoML progress log: Unexpected HTTP error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc520a0>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c90278580>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc52b80>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc41520>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc41e80>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc3d820>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc587f0>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c90278e20>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc52b20>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Unexpected HTTP error: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0c8dc52460>: Failed to establish a new connection: [Errno 111] Connection refused')), will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Job request failed Server error java.lang.IllegalArgumentException: Error: Job is missing Request: GET /3/Jobs/$03017f00000132d4ffffffff$_944e2145ad1f3fe9d6efe4d1734a4342 , will retry after 3s. Failed polling AutoML progress log: No AutoML instance with id AutoML_2_20230801_34111@@L2BL30_MAU_PCC_TC_302_FB. Closing connection _sid_9d6e at exit H2O session _sid_9d6e closed.

Please help. Thanks.

yanwun commented 1 year ago

Is this a hardware problem? because i use the same dataset but different instance type limit for container . It can be trained.

I use 10 CPU (threads) 10G memory can be trained, but 8 CPU 8 G memory will failed like above.

wendycwong commented 1 year ago

Yes, if you do not give the hardware enough memory, it can fail. The rule of thumb is to allocate total memory of at least 3 to 5 times the dataset size. I want to be conservative and would probably do 10 times the dataset size. E.g. if your dataset is 1GB, start a cluster with total memory of 10GB should work.

yanwun commented 1 year ago

Oh, that's wired, my dataset only have 1.62MB, but it failed to connect sometimes, I always use 10 times over than my training and test datasets. Is there any suggestion for this issue? I still got failed in my training process, sometimes.

tomasfryda commented 1 year ago

@yanwun This is really unexpected behavior. It looks to me like h2o runs out of memory and gets killed since docker doesn't allow apps to use swap.

But it shouldn't run out of memory since you use such a small file. The only reason I can think of is that the frame would read the numeric values as categorical and if you have a lot of unique numeric values that could increase the memory usage a lot if it would be interpreted as categorical values.

Could you check using the df_h2o.types that the numerical columns are really numerical?

If the columns are the same type as you expect and you still encounter this issue, would you be able to provide us logs from the h2o backend? The location of the logs is usually printed during the start. It looks like:

  JVM stdout: /var/folders/yl/cq5nhky53hjcl9wrqxt39kz80000gn/T/tmpN2xfkW/h2o_techwriter_started_from_python.out
  JVM stderr: /var/folders/yl/cq5nhky53hjcl9wrqxt39kz80000gn/T/tmpN2xfkW/h2o_techwriter_started_from_python.err

If you don't see those lines during the start you can set log_dir and ice_dir params in the h2o.init() to tell h2o where to save the logs. (more details here)

wendycwong commented 1 year ago

@yanwn: You can force the column types during parsing by setting the parameter col_types.