Open magrenimish opened 10 months ago
@magrenimish This is likely a memory issue - XGBoost lives in memory outside of JVM so h2o and xgboost compete for the same memory. Please see the documentation to find out how to limit h2o's memory so that XGBoost fits in the memory.
@tomasfryda As mentioned in the documentation, I allowed less than 2/3 of the total available RAM to H2O, leaving the rest for XGB. Available memory to XGB is well beyond 100gb.
@magrenimish Thanks for adding the available memory information. I don't see any obvious reason why it should fail like this. Would you be able to provide us with logs? Please make sure there are no confidential data in the logs (the log might contain user name, column names, loaded file names etc).
@tomasfryda Here is the log: automodeler.log
Thank you @magrenimish . Unfortunately that's not the H2O (backend) log. Please see https://docs.h2o.ai/h2o/latest-stable/h2o-docs/logs.html to find out how to get the H2O (backend) logs.
Hi @tomasfryda I tried to access the H2O logs zip folder and after downloading it, I only see a 'nohup.out' file as attached here: automodeler_h2o_logs (1).zip
That's exactly the kind of log that what we need, thank you @magrenimish . It looks like the failure occurs during the data load so it doesn't even get to AutoML.
The log ends in the middle of a line which I think might be due to OOM error but it's weird because you the file should be much smaller than available memory. @wendycwong I think this is a bug related to parquet parser.
The end of the log:
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-43 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 122370
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-113 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 123275
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-105 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 121590
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-87 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 126722
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-47 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 125185
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-19 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 125131
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-43 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 122371
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-113 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 123276
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-105 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value: 121591
12-26 21:21:23.722 127.0.0.1:16822 9972 FJ-3-87 DEBUG org.apache.parquet.hadoop.InternalParquetRecordReader: read value
Hi @tomasfryda @wendycwong, were you able to confirm if this was an error related to parquet parsing?
Hi Nimish:
I don't have your parquet file, so I created one for myself. I started my backend using this command:
java -Xmx50g -jar build/h2o.jar
I ran the following code. Please change the directory path to your path if you want to run my code:
fr = h2o.create_frame(rows=163481, cols=851, real_fraction=1.0, categorical_fraction=0, has_response=True, response_factors=2, seed=12345, missing_fraction=0.0) h2o.export_file(fr, "/Users/wendycwong/temp/gh_16011.parquet", header=True, format="parquet") # export as parquet file h2o.remove_all() fr = h2o.import_file("/Users/wendycwong/temp/gh_16011.parquet") m = H2OXGBoostEstimator(ntrees=10, seed=1234) m.train(x=list(range(1, fr.ncol)), y="response", training_frame=fr) print("Done")
The code run okay for me. So, the file size is not an issue here (I was worried about that).
So, without having access to your parquet code, I cannot debug what the problem is with your file. If you can change your parquet file to .csv, perhaps that may run for you.
Thanks, Wendy
Hi @wendycwong: I tried your code and it worked without any issues on the sample data you created. However, I tried your code with the data I have and got the same connection error as I had mentioned before. @tomasfryda I was able to import my parquet file into an H2O dataframe successfully, so I don't think it is an issue with the data loading. Here is the process I followed and the error I see (with H2O==3.44.0.1): `>>> from h2o.estimators.xgboost import H2OXGBoostEstimator
m = H2OXGBoostEstimator(ntrees=10, seed=1234) m.train(x=common_cols, y=y, training_frame=fr) Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 722, in urlopen chunked=chunked, File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 244, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib64/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.7/http/client.py", line 976, in send self.connect() File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f1290a16110>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/adapters.py", line 497, in send chunked=chunked, File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 800, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/ModelBuilders/xgboost (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1290a16110>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/backend/connection.py", line 495, in request stream=stream, args) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=54321): Max retries exceeded with url: /3/ModelBuilders/xgboost (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1290a16110>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "
The Python script with H2O (version 3.44.0.2) AutoML with only the XGB model included runs fine with < 850 training columns for the dataset but fails when the training columns exceed that limit. The total dataset size is ~12gb and I am running the script on an AWS r6a.12xlarge instance. The script fails with the following error log: 02-Jan-24 16:36:19 : 42058 : automodeler.py | ln833 | main() : ERROR : Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=30364): Max retries exceeded with url: /3/Jobs/$03017f0000019d76ffffffff$_86c7d159e1c43c4bbc121d401960489f (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f21b054bcd0>: Failed to establish a new connection: [Errno 111] Connection refused')) Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection raise err File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 722, in urlopen chunked=chunked, File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416, in _make_request conn.request(method, url, **httplib_request_kw) File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 244, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/lib64/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.7/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.7/http/client.py", line 976, in send self.connect() File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect conn = self._new_conn() File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn self, "Failed to establish a new connection: %s" % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f21b054bcd0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/adapters.py", line 497, in send chunked=chunked, File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 800, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/home/ec2-user/.local/lib/python3.7/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=30364): Max retries exceeded with url: /3/Jobs/$03017f0000019d76ffffffff$_86c7d159e1c43c4bbc121d401960489f (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f21b054bcd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/backend/connection.py", line 495, in request stream=stream, args) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/home/ec2-user/.local/lib/python3.7/site-packages/requests/adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=30364): Max retries exceeded with url: /3/Jobs/$03017f0000019d76ffffffff$_86c7d159e1c43c4bbc121d401960489f (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f21b054bcd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "automodeler.py", line 827, in main aml.train(x=common_cols[:850], y=target, training_frame=h2o_data_file) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/automl/_estimator.py", line 682, in train self._job.poll(poll_updates=poll_updates) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 69, in poll pb.execute(self._refresh_job_status, progress_monitor_fn=ft.partial(poll_updates, self)) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/utils/progressbar.py", line 187, in execute res = progress_fn() # may raise StopIteration File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 136, in _refresh_job_status jobs = self._query_job_status_safe() File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 132, in _query_job_status_safe raise last_err File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 114, in _query_job_status_safe result = h2o.api("GET /3/Jobs/%s" % self.job_key) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/h2o.py", line 122, in api return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) File "/home/ec2-user/.local/lib/python3.7/site-packages/h2o/backend/connection.py", line 507, in request raise H2OConnectionError("Unexpected HTTP error: %s" % e) h2o.exceptions.H2OConnectionError: Unexpected HTTP error: HTTPConnectionPool(host='localhost', port=30364): Max retries exceeded with url: /3/Jobs/$03017f0000019d76ffffffff$_86c7d159e1c43c4bbc121d401960489f (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f21b054bcd0>: Failed to establish a new connection: [Errno 111] Connection refused')) 02-Jan-24 16:36:19 : 42058 : automodeler.py | ln835 | main() : ERROR : ERROR=!!{"error": "AM_Modeling_Data_Error"}!! 02-Jan-24 16:36:19 : 42058 : MinIO.py | ln122 | minio_logger_s3_push() : INFO : LOGGER: Pushed to S3 02-Jan-24 16:36:19 : 42058 : Messaging.py | ln 74 | send_error() : INFO : Sending ERROR notification 02-Jan-24 16:36:20 : 42058 : H2O.py | ln462 | h2o_shutdown() : INFO : EXECUTION FAILED!!