h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

H2O XGBoost crash: H2OConnectionError: Local server has died unexpectedly. RIP #8316

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

While training XGBoost with large amount of data (mixed with categorical and numerical values) I got the error 'H2OConnectionError: Local server has died unexpectedly. RIP." I did trim down the data a bit...

The crash is reproducible.

I have a gzip'ed H2O datafile I used to cause this crash, but it's too big to upload here: 51MB compressed. I've already reduced the size. My experience if if I reduce the size by a lot, it won't crash, so I'm probably provoking some limitation that's not handled well. I can send you the file a different way if you want.

Also…. I ran into a possibly similar situation, where the training seems to stop but no error is displayed and unix top shows the Java activity drop to 0.3% or so.

{code:python}import h2o h2o.init( strict_version_check=False, nthreads=1, log_dir="/tmp/clem-h2o/", log_level='TRACE' )

killer = h2o.import_file(path = "killerh2o") x = killer.columns[:] x.remove('y') y = 'y'

from h2o.estimators import H2OXGBoostEstimator param = { "ntrees" : 15 , "min_rows" : 5 , "max_depth" : 5 , "learn_rate" : 0.02 , "sample_rate" : 0.7 , "col_sample_rate_per_tree" : 0.9 , "seed": 42 , "score_tree_interval": 100 }

from h2o.estimators import H2OXGBoostEstimator

model = H2OXGBoostEstimator(**param) model.train(x=x, y=y, training_frame=killer)

xgboost Model Build progress:

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_a5d4af787d456bcf321fe46f674a4332 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff489d14780>: Failed to establish a new connection: [Errno 111] Connection refused',))

H2OConnectionError: Local server has died unexpectedly. RIP.{code}

{noformat}H2O Version: 3.28.0.3 Python 3.6.9 Ubuntu 18.04.3 LTS{noformat}

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Hi [~accountid:5dd29cfb57e9480e500a3e69] , thanks for reporting this.

To better understadn this I would need to h2o back-end logs. When you start h2o from python like this its the two files mentioned at the top of the output:

{{JVM stdout: /tmp/tmp3uq9r2u7/h2o_unknownUser_started_from_python.out }} {{JVM stderr: /tmp/tmp3uq9r2u7/h2o_unknownUser_started_from_python.err}} also if you could share the data-file for example via [https://wetransfer.com/|https://wetransfer.com/] that would be also very helpful.

Thank you.

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: contains tgz of all log files, compressed h2o file

[https://we.tl/t-y43CPN7QpW|https://we.tl/t-y43CPN7QpW]

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: I don’t want to yet create another possible bug, but…. this may be similar? I was running Grid Search using XGBoost on similar data that I reported in this bug and it died in perhaps a similar way?

This is the error I got:

{noformat}--------------------------------------------------------------------------- RemoteDisconnected Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 383 # otherwise it looks like a programming error was the cause. --> 384 six.raise_from(e, None) 385 except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 379 try: --> 380 httplib_response = conn.getresponse() 381 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self) 1345 try: -> 1346 response.begin() 1347 except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self) 306 while True: --> 307 version, status, reason = self._read_status() 308 if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self) 275 # sending a valid response. --> 276 raise RemoteDisconnected("Remote end closed connection without" 277 " response")

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 637 retries = retries.increment(method, url, error=e, _pool=self, --> 638 _stacktrace=sys.exc_info()[2]) 639 retries.sleep()

/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 366 if read is False or not self._is_method_retryable(method): --> 367 raise six.reraise(type(error), error, _stacktrace) 368 elif read is not None:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb) 684 if value.traceback is not tb: --> 685 raise value.with_traceback(tb) 686 raise value

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 383 # otherwise it looks like a programming error was the cause. --> 384 six.raise_from(e, None) 385 except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 379 try: --> 380 httplib_response = conn.getresponse() 381 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self) 1345 try: -> 1346 response.begin() 1347 except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self) 306 while True: --> 307 version, status, reason = self._read_status() 308 if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self) 275 # sending a valid response. --> 276 raise RemoteDisconnected("Remote end closed connection without" 277 " response")

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 473 headers=headers, timeout=self._timeout, stream=stream, --> 474 auth=self._auth, verify=verify, proxies=self._proxies) 475 if isinstance(save_to, types.FunctionType):

/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, kwargs) 59 with sessions.Session() as session: ---> 60 return session.request(method=method, url=url, kwargs) 61

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 532 send_kwargs.update(settings) --> 533 resp = self.send(prep, **send_kwargs) 534

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, kwargs) 645 # Send the request --> 646 r = adapter.send(request, kwargs) 647

/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 497 except (ProtocolError, socket.error) as err: --> 498 raise ConnectionError(err, request=request) 499

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

H2OConnectionError Traceback (most recent call last)

in 12 training_frame=hdftrain, 13 validation_frame=hdftest, ---> 14 seed=42 15 ) 16 /usr/local/lib/python3.6/dist-packages/h2o/grid/grid_search.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, **params) 226 x = list(xset) 227 parms["x"] = x --> 228 self.build_model(parms) 229 230 /usr/local/lib/python3.6/dist-packages/h2o/grid/grid_search.py in build_model(self, algo_params) 244 y = y if y in training_frame.names else training_frame.names[y] 245 self.model._estimator_type = "classifier" if training_frame.types[y] == "enum" else "regressor" --> 246 self._model_build(x, y, training_frame, validation_frame, algo_params) 247 248 /usr/local/lib/python3.6/dist-packages/h2o/grid/grid_search.py in _model_build(self, x, y, tframe, vframe, kwargs) 272 return 273 --> 274 grid.poll() 275 276 grid_json = h2o.api("GET /99/Grids/%s" % (grid.dest_key)) /usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates) 58 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self)) 59 else: ---> 60 pb.execute(self._refresh_job_status) 61 except StopIteration as e: 62 if str(e) == "cancelled": /usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info) 169 # Query the progress level, but only if it's time already 170 if self._next_poll_time <= now: --> 171 res = progress_fn() # may raise StopIteration 172 assert_is_type(res, (numeric, numeric), numeric) 173 if not isinstance(res, tuple): /usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self) 96 def _refresh_job_status(self): 97 if self._poll_count <= 0: raise StopIteration("") ---> 98 jobs = h2o.api("GET /3/Jobs/%s" % self.job_key) 99 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0] 100 self.status = self.job["status"] /usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 121 # type checks are performed in H2OConnection class 122 _check_connection() --> 123 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 124 125 /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 484 else: 485 self._log_end_exception(e) --> 486 raise H2OConnectionError("Unexpected HTTP error: %s" % e) 487 except requests.exceptions.Timeout as e: 488 self._log_end_exception(e) H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',)){noformat}
exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Could you please share the code for the second crash as well. Also thanks for uploading the data and log files. I will try to reproduce this and see whats happening.

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Hi [~accountid:557058:aaa5569b-b1d8-42eb-bc9e-5ef03c903199] I have tried reproducing your issue and was not able to.

I will need more details from you:

what os/version are you running?

what kind of cpu/ram do you have on your machine?

Also please try the following:

download latest stable h2o.jar from [http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html|http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html]

update your python package via pip

start h2o via command line: $ java -jar h2o.jar | tee h2o.log

run the following script

{noformat}import h2o from h2o.estimators import H2OXGBoostEstimator h2o.init()

killer = h2o.import_file("/path/to/data.csv") x = killer.columns[:] x.remove('y') y = 'y'

model = H2OXGBoostEstimator( ntrees = 15, min_rows = 5, max_depth = 5, learn_rate = 0.02, sample_rate = 0.7, col_sample_rate_per_tree = 0.9, seed= 42, score_tree_interval= 100 ) model.train(x=x, y=y, training_frame=killer){noformat}

and please let me know the result and if you get an error please share the h2o.log file

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: Sorry for the delay. Still crashes with 3.28.1.2

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: thanks for letting me know, could you please share the h2o.log and output of the script above? Thanks

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: # The log files are too big (I forgot how I sent you the last one.)

I got the log files this way:

{{h2o.init(}} {{strict_version_check=False,}} {{nthreads=1,}} {{log_dir="/tmp/clem-h2o/",}} {{log_level='TRACE'}} {{)}}

Is that OK? I’m not sure how to start up{{ java -jar h2o.jar }}because I don’t know where {{h2o.jar}} gets installed…

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: you can download it here [http://h2o-release.s3.amazonaws.com/h2o/rel-yule/2/index.html|http://h2o-release.s3.amazonaws.com/h2o/rel-yule/2/index.html] you can send me the log files via [wetransfer.com|http://wetransfer.com]

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: with {{java -jar h2o.jar | tee h2o.log}} command, it crashes. Attached is the log file (smaller).

Here’s the error dumped out:

{noformat}--------------------------------------------------------------------------- ConnectionRefusedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connection.py in _new_conn(self) 170 conn = connection.create_connection( --> 171 (self._dns_host, self.port), self.timeout, **extra_kw) 172

/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 78 if err is not None: ---> 79 raise err 80

/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 68 sock.bind(source_address) ---> 69 sock.connect(sa) 70 return sock

ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

NewConnectionError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, httplib_request_kw) 353 else: --> 354 conn.request(method, url, httplib_request_kw) 355

/usr/lib/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked) 1253 """Send a complete request to the server.""" -> 1254 self._send_request(method, url, body, headers, encode_chunked) 1255

/usr/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked) 1299 body = _encode(body, 'body') -> 1300 self.endheaders(body, encode_chunked=encode_chunked) 1301

/usr/lib/python3.6/http/client.py in endheaders(self, message_body, encode_chunked) 1248 raise CannotSendHeader() -> 1249 self._send_output(message_body, encode_chunked=encode_chunked) 1250

/usr/lib/python3.6/http/client.py in _send_output(self, message_body, encode_chunked) 1035 del self._buffer[:] -> 1036 self.send(msg) 1037

/usr/lib/python3.6/http/client.py in send(self, data) 973 if self.auto_open: --> 974 self.connect() 975 else:

/usr/local/lib/python3.6/dist-packages/urllib3/connection.py in connect(self) 195 def connect(self): --> 196 conn = self._new_conn() 197 self._prepare_conn(conn)

/usr/local/lib/python3.6/dist-packages/urllib3/connection.py in _new_conn(self) 179 raise NewConnectionError( --> 180 self, "Failed to establish a new connection: %s" % e) 181

NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f301f066ac8>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

MaxRetryError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 637 retries = retries.increment(method, url, error=e, _pool=self, --> 638 _stacktrace=sys.exc_info()[2]) 639 retries.sleep()

/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 397 if new_retry.is_exhausted(): --> 398 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 399

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_8ab4572cf3ac442cb16cb05f2a7840b9 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f301f066ac8>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 473 headers=headers, timeout=self._timeout, stream=stream, --> 474 auth=self._auth, verify=verify, proxies=self._proxies) 475 if isinstance(save_to, types.FunctionType):

/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, kwargs) 59 with sessions.Session() as session: ---> 60 return session.request(method=method, url=url, kwargs) 61

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 532 send_kwargs.update(settings) --> 533 resp = self.send(prep, **send_kwargs) 534

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, kwargs) 645 # Send the request --> 646 r = adapter.send(request, kwargs) 647

/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 515 --> 516 raise ConnectionError(e, request=request) 517

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=54321): Max retries exceeded with url: /3/Jobs/$03017f00000132d4ffffffff$_8ab4572cf3ac442cb16cb05f2a7840b9 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f301f066ac8>: Failed to establish a new connection: [Errno 111] Connection refused',))

During handling of the above exception, another exception occurred:

H2OConnectionError Traceback (most recent call last)

in 17 18 model = H2OXGBoostEstimator(**param) ---> 19 model.train(x=x, y=y, training_frame=killer) /usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose) 110 self._train(x=x, y=y, training_frame=training_frame, offset_column=offset_column, fold_column=fold_column, 111 weights_column=weights_column, validation_frame=validation_frame, max_runtime_secs=max_runtime_secs, --> 112 ignored_columns=ignored_columns, model_id=model_id, verbose=verbose) 113 114 /usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in _train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose, extend_parms_fn) 263 return 264 --> 265 model.poll(poll_updates=self._print_model_scoring_history if verbose else None) 266 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0] 267 self._resolve_model(model.dest_key, model_json) /usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates) 58 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self)) 59 else: ---> 60 pb.execute(self._refresh_job_status) 61 except StopIteration as e: 62 if str(e) == "cancelled": /usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info) 169 # Query the progress level, but only if it's time already 170 if self._next_poll_time <= now: --> 171 res = progress_fn() # may raise StopIteration 172 assert_is_type(res, (numeric, numeric), numeric) 173 if not isinstance(res, tuple): /usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self) 96 def _refresh_job_status(self): 97 if self._poll_count <= 0: raise StopIteration("") ---> 98 jobs = h2o.api("GET /3/Jobs/%s" % self.job_key) 99 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0] 100 self.status = self.job["status"] /usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 107 # type checks are performed in H2OConnection class 108 _check_connection() --> 109 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 110 111 /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 481 if self._local_server and not self._local_server.is_running(): 482 self._log_end_exception("Local server has died.") --> 483 raise H2OConnectionError("Local server has died unexpectedly. RIP.") 484 else: 485 self._log_end_exception(e) H2OConnectionError: Local server has died unexpectedly. RIP.{noformat}
exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Withtout the generated h2o.log file its impossible for me to see whats causing this.

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: Sorry missed the attached log file. Unfortunatelly the log file does not contain anything. I will keep experimenting on my side

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: This is a bit different. I started H2O with python:

h2o.init( strict_version_check=False, nthreads=1, log_dir="/tmp/clem-h2o/", log_level='TRACE' )

which generated these log files:

ls -l clem-h2o/ total 36 -rw-r--r-- 1 root root 18895 Mar 19 23:03 h2o_127.0.0.1_54321-3-info.log -rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-4-warn.log -rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-5-error.log -rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-6-fatal.log -rw-r--r-- 1 root root 13574 Mar 19 23:04 h2o_127.0.0.1_54321-httpd.log

attached as a tgz file

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: I have another test case….

I tried dropping some of the data:

{{killer = killer.drop(list(range(15000000)), axis=0)}}

The training still dies in a similar (same?) way, but the log files are MUCH bigger.

{{ls -l clem-h2o/}} {{total 135712}} {{-rw-r--r-- 1 root root 11099 Mar 19 23:12 h2o_127.0.0.1_54321-3-info.log}} {{-rw-r--r-- 1 root root 138921951 Mar 19 23:12 h2o_127.0.0.1_54321-3-info.log.1}} {{-rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-4-warn.log}} {{-rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-5-error.log}} {{-rw-r--r-- 1 root root 0 Mar 19 23:02 h2o_127.0.0.1_54321-6-fatal.log}} {{-rw-r--r-- 1 root root 29407 Mar 19 23:13 h2o_127.0.0.1_54321-httpd.log}}

Would you like these logs?

exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: yes please, so far I have nothing to go on, also please what HW/OS/version are you running?

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: {{root@clem-2ewang-jupyter:/home/clem.wang# uname -a}} {{Linux clem-2ewang-jupyter 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux}} {{root@clem-2ewang-jupyter:/home/clem.wang# lsb_release -a}} {{No LSB modules are available.}} {{Distributor ID: Ubuntu}} {{Description: Ubuntu 18.04.4 LTS}} {{Release: 18.04}} {{Codename: bionic}} {{root@clem-2ewang-jupyter:/home/clem.wang# cat /proc/version}} {{Linux version 4.14.152-127.182.amzn2.x86_64 (mockbuild@ip-10-0-1-129) (gcc version 7.3.1 20180712 (Red Hat 7.3.1-6) (GCC)) #1 SMP Thu Nov 14 17:32:43 UTC 2019}}

{{# lshw}} {{clem-2ewang-jupyter}} {{description: Computer}} {{width: 64 bits}} {{capabilities: smp vsyscall32}} {{-core}} {{description: Motherboard}} {{physical id: 0}} {{-memory}} {{description: System memory}} {{physical id: 0}} {{size: 747GiB}} {{-cpu:0}} {{product: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz}} {{vendor: Intel Corp.}} {{physical id: 1}} {{bus info: cpu@0}} {{width: 64 bits}} {{capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke}} {{-cpu:1}} {{product: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz}} {{vendor: Intel Corp.}} {{physical id: 2}} {{bus info: cpu@1}} {{width: 64 bits}} {{capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke}} {{-pci}} {{description: Host bridge}} {{product: 440FX - 82441FX PMC [Natoma]}} {{vendor: Intel Corporation}} {{physical id: 100}} {{bus info: pci@0000:00:00.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{-isa}} {{description: ISA bridge}} {{product: 82371SB PIIX3 ISA [Natoma/Triton II]}} {{vendor: Intel Corporation}} {{physical id: 1}} {{bus info: pci@0000:00:01.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: isa}} {{configuration: latency=0}} {{-generic UNCLAIMED}} {{description: Non-VGA unclassified device}} {{product: 82371AB/EB/MB PIIX4 ACPI}} {{vendor: Intel Corporation}} {{physical id: 1.3}} {{bus info: pci@0000:00:01.3}} {{version: 08}} {{width: 32 bits}} {{clock: 33MHz}} {{configuration: latency=0}} {{-display UNCLAIMED}} {{description: VGA compatible controller}} {{product: Amazon.com, Inc.}} {{vendor: Amazon.com, Inc.}} {{physical id: 3}} {{bus info: pci@0000:00:03.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: vga_controller}} {{configuration: latency=0}} {{resources: memory:fe400000-fe7fffff memory:c0000-dffff}} {{-storage:0}} {{description: Non-Volatile memory controller}} {{product: Amazon.com, Inc.}} {{vendor: Amazon.com, Inc.}} {{physical id: 4}} {{bus info: pci@0000:00:04.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:11 memory:febe0000-febe3fff}} {{-network:0}} {{description: Ethernet controller}} {{product: Elastic Network Adapter (ENA)}} {{vendor: Amazon.com, Inc.}} {{physical id: 5}} {{bus info: pci@0000:00:05.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: bus_master cap_list}} {{configuration: driver=ena latency=0}} {{resources: irq:0 memory:febe4000-febe7fff memory:fe800000-fe8fffff memory:febd0000-febdffff}} {{-network:1}} {{description: Ethernet controller}} {{product: Elastic Network Adapter (ENA)}} {{vendor: Amazon.com, Inc.}} {{physical id: 6}} {{bus info: pci@0000:00:06.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: bus_master cap_list}} {{configuration: driver=ena latency=0}} {{resources: irq:0 memory:c0110000-c0113fff memory:c0000000-c00fffff memory:c0100000-c010ffff}} {{-storage:1}} {{description: Non-Volatile memory controller}} {{product: Amazon.com, Inc.}} {{vendor: Amazon.com, Inc.}} {{physical id: 1b}} {{bus info: pci@0000:00:1b.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:10 memory:c0114000-c0117fff}} {{-storage:2}} {{description: Non-Volatile memory controller}} {{product: NVMe SSD Controller}} {{vendor: Amazon.com, Inc.}} {{physical id: 1c}} {{bus info: pci@0000:00:1c.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:0 memory:febe8000-febebfff memory:fe900000-fe901fff}} {{-storage:3}} {{description: Non-Volatile memory controller}} {{product: NVMe SSD Controller}} {{vendor: Amazon.com, Inc.}} {{physical id: 1d}} {{bus info: pci@0000:00:1d.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:0 memory:febec000-febeffff memory:fe902000-fe903fff}} {{-storage:4}} {{description: Non-Volatile memory controller}} {{product: NVMe SSD Controller}} {{vendor: Amazon.com, Inc.}} {{physical id: 1e}} {{bus info: pci@0000:00:1e.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:0 memory:febf0000-febf3fff memory:fe904000-fe905fff}} {{-storage:5}} {{description: Non-Volatile memory controller}} {{product: NVMe SSD Controller}} {{vendor: Amazon.com, Inc.}} {{physical id: 1f}} {{bus info: pci@0000:00:1f.0}} {{version: 00}} {{width: 32 bits}} {{clock: 33MHz}} {{capabilities: storage nvm_express bus_master cap_list}} {{configuration: driver=nvme latency=0}} {{resources: irq:0 memory:febf4000-febf7fff memory:fe906000-fe907fff}} {{*-network}} {{description: Ethernet interface}} {{physical id: 1}} {{logical name: eth0}} {{serial: da:ba:10:fb:c6:c4}} {{size: 10Gbit/s}} {{capabilities: ethernet physical}} {{configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full ip=10.86.91.6 link=yes multicast=yes port=twisted pair speed=10Gbit/s}}

Let me know if there are other useful unix commands to run.

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: log files from this crash are here: [https://we.tl/t-hYBYzFeDqH|https://we.tl/t-hYBYzFeDqH] (too big to attach).

h2o.init( strict_version_check=False, nthreads=1, log_dir="/tmp/clem-h2o/", log_level='TRACE' )

{noformat}Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_242"; OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08); OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode) Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpguf4e1xq JVM stdout: /tmp/tmpguf4e1xq/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpguf4e1xq/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. {noformat}

|H2O cluster uptime:|01 secs| |H2O cluster timezone:|Etc/UTC| |H2O data parsing timezone:|UTC| |H2O cluster version:|3.28.1.2| |H2O cluster version age:|2 days| |H2O cluster name:|H2O_from_python_unknownUser_rlgkzb| |H2O cluster total nodes:|1| |H2O cluster free memory:|4.445 Gb| |H2O cluster total cores:|4| |H2O cluster allowed cores:|1| |H2O cluster status:|accepting new members, healthy| |H2O connection url:|[http://127.0.0.1:54321|http://127.0.0.1:54321]| |H2O connection proxy:|{'http': None, 'https': None}| |H2O internal security:|False| |H2O API Extensions:|Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4| |Python version:|3.6.9 final|

[41]:

{noformat}killer = h2o.import_file(path = "killerh2o") x = killer.columns[:] x.remove('y') y = 'y'{noformat}

killer = killer.drop(list(range(15000000)), axis=0)

{noformat}killer.nrows{noformat}

[45]:

{noformat}10694786{noformat}

{{from h2o.estimators import H2OXGBoostEstimator}} {{param = {}} {{"ntrees" : 15}} {{, "min_rows" : 5}} {{, "max_depth" : 5}} {{, "learn_rate" : 0.02}} {{, "sample_rate" : 0.7}} {{, "col_sample_rate_per_tree" : 0.9}} {{, "seed": 42}} {{, "score_tree_interval": 100}} {{}}}

{{from h2o.estimators import H2OXGBoostEstimator}}

{{model = H2OXGBoostEstimator(**param)}} {{model.train(x=x, y=y, training_frame=killer)}}

{noformat}xgboost Model Build progress: |███████

ConnectionResetError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 383 # otherwise it looks like a programming error was the cause. --> 384 six.raise_from(e, None) 385 except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 379 try: --> 380 httplib_response = conn.getresponse() 381 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self) 1345 try: -> 1346 response.begin() 1347 except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self) 306 while True: --> 307 version, status, reason = self._read_status() 308 if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self) 267 def _read_status(self): --> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 269 if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b) 585 try: --> 586 return self._sock.recv_into(b) 587 except timeout:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 448 retries=self.max_retries, --> 449 timeout=timeout 450 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 637 retries = retries.increment(method, url, error=e, _pool=self, --> 638 _stacktrace=sys.exc_info()[2]) 639 retries.sleep()

/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 366 if read is False or not self._is_method_retryable(method): --> 367 raise six.reraise(type(error), error, _stacktrace) 368 elif read is not None:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb) 684 if value.traceback is not tb: --> 685 raise value.with_traceback(tb) 686 raise value

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 383 # otherwise it looks like a programming error was the cause. --> 384 six.raise_from(e, None) 385 except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 379 try: --> 380 httplib_response = conn.getresponse() 381 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self) 1345 try: -> 1346 response.begin() 1347 except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self) 306 while True: --> 307 version, status, reason = self._read_status() 308 if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self) 267 def _read_status(self): --> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 269 if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b) 585 try: --> 586 return self._sock.recv_into(b) 587 except timeout:

ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 473 headers=headers, timeout=self._timeout, stream=stream, --> 474 auth=self._auth, verify=verify, proxies=self._proxies) 475 if isinstance(save_to, types.FunctionType):

/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, kwargs) 59 with sessions.Session() as session: ---> 60 return session.request(method=method, url=url, kwargs) 61

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 532 send_kwargs.update(settings) --> 533 resp = self.send(prep, **send_kwargs) 534

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, kwargs) 645 # Send the request --> 646 r = adapter.send(request, kwargs) 647

/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 497 except (ProtocolError, socket.error) as err: --> 498 raise ConnectionError(err, request=request) 499

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

H2OConnectionError Traceback (most recent call last)

in 17 18 model = H2OXGBoostEstimator(**param) ---> 19 model.train(x=x, y=y, training_frame=killer) /usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose) 110 self._train(x=x, y=y, training_frame=training_frame, offset_column=offset_column, fold_column=fold_column, 111 weights_column=weights_column, validation_frame=validation_frame, max_runtime_secs=max_runtime_secs, --> 112 ignored_columns=ignored_columns, model_id=model_id, verbose=verbose) 113 114 /usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in _train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose, extend_parms_fn) 263 return 264 --> 265 model.poll(poll_updates=self._print_model_scoring_history if verbose else None) 266 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0] 267 self._resolve_model(model.dest_key, model_json) /usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates) 58 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self)) 59 else: ---> 60 pb.execute(self._refresh_job_status) 61 except StopIteration as e: 62 if str(e) == "cancelled": /usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info) 169 # Query the progress level, but only if it's time already 170 if self._next_poll_time <= now: --> 171 res = progress_fn() # may raise StopIteration 172 assert_is_type(res, (numeric, numeric), numeric) 173 if not isinstance(res, tuple): /usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self) 96 def _refresh_job_status(self): 97 if self._poll_count <= 0: raise StopIteration("") ---> 98 jobs = h2o.api("GET /3/Jobs/%s" % self.job_key) 99 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0] 100 self.status = self.job["status"] /usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 107 # type checks are performed in H2OConnection class 108 _check_connection() --> 109 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 110 111 /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 481 if self._local_server and not self._local_server.is_running(): 482 self._log_end_exception("Local server has died.") --> 483 raise H2OConnectionError("Local server has died unexpectedly. RIP.") 484 else: 485 self._log_end_exception(e) H2OConnectionError: Local server has died unexpectedly. RIP.{noformat}
exalate-issue-sync[bot] commented 1 year ago

Jan Sterba commented: unable to reproduce

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7321 Assignee: Jan Sterba Reporter: Clem Wang State: Closed Fix Version: N/A Attachments: Available (Count: 2) Development PRs: N/A

Attachments From Jira

Attachment Name: clem-h2o-2020-03-19.logs.tgz Attached By: Clem Wang File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7321/clem-h2o-2020-03-19.logs.tgz

Attachment Name: h2o.log.gz Attached By: Clem Wang File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7321/h2o.log.gz