h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

Python API subsetting dataframe fails for different values #11593

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I get different errors depending on how many times i've subset a dataframe, for the dataframe that is attached even for values that exist in the dataframe, when you subset on them h2o doesn't think they exist or throws the following error (full errors at the bottom), errors depend on what column you subset by (string, real, for example)

{code} tester = h2o.import_file('path/to/example_1.csv') tester.head() tester[tester['current_actual_upb']==188348] tester[tester['monthly_reporting_period']=='200504'] {code}

Note that the following works just fine tester[tester['current_actual_upb']==189000]

(note the second time I tried this it didn't return anything, no result, no error message -- this is in a jupyter notebook with python 3.5 and h2o 3.10.5.3)

TypeError: zip_longest argument #2 must support iteration and

{code} Error: Not a String Request: GET /3/Frames/py_19_sid_8630 params: {'row_count': '10'} {code}

{code}

TypeError Traceback (most recent call last) /usr/local/lib/python3.5/site-packages/IPython/core/formatters.py in call(self, obj) 668 type_pprinters=self.type_printers, 669 deferred_pprinters=self.deferred_printers) --> 670 printer.pretty(obj) 671 printer.flush() 672 return stream.getvalue()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return _default_pprint(obj, self, cycle) 384 finally: 385 self.end_group()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, 'repr', None) not in _baseclassreprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 _repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<')

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _reprpprint(obj, p, cycle) 692 """A pprint that just redirects to the normal repr function.""" 693 # Find newlines and replace them with p.break() --> 694 output = repr(obj) 695 for idx,output_line in enumerate(output.splitlines()): 696 if idx:

/usr/local/lib/python3.5/site-packages/h2o/frame.py in repr(self) 403 stk = traceback.extract_stack() 404 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]): --> 405 self.show() 406 return "" 407

/usr/local/lib/python3.5/site-packages/h2o/frame.py in show(self, use_pandas) 421 IPython.display.display(self.head().as_data_frame(True)) 422 else: --> 423 IPython.display.display_html(self._ex._cache._tabulate("html", False), raw=True) 424 else: 425 if use_pandas and can_use_pandas():

/usr/local/lib/python3.5/site-packages/h2o/expr.py in _tabulate(self, tablefmt, rollups) 361 x = [v['type'], mins, v['mean'], maxs, v['sigma'], v['zero_count'], v['missing_count']] + x 362 d[k] = x # Insert into ordered-dict --> 363 return tabulate.tabulate(d, headers="keys", tablefmt=tablefmt) 364 365 def flush(self): # flush everything but the frame_id

/usr/local/lib/python3.5/site-packages/tabulate.py in tabulate(tabular_data, headers, tablefmt, floatfmt, numalign, stralign, missingval, showindex, disable_numparse) 1107 tabular_data = [] 1108 list_of_lists, headers = _normalize_tabular_data( -> 1109 tabular_data, headers, showindex=showindex) 1110 1111 # empty values in the first column of RST tables should be escaped (issue #82)

/usr/local/lib/python3.5/site-packages/tabulate.py in _normalize_tabular_data(tabular_data, headers, showindex) 727 # likely a conventional dict 728 keys = tabular_data.keys() --> 729 rows = list(izip_longest(*tabular_data.values())) # columns have to be transposed 730 elif hasattr(tabular_data, "index"): 731 # values is a property, has .index => it's likely a pandas.DataFrame (pandas 0.11.0)

TypeError: zip_longest argument #2 must support iteration {code}

{code}

H2OResponseError Traceback (most recent call last) /usr/local/lib/python3.5/site-packages/IPython/core/formatters.py in call(self, obj) 668 type_pprinters=self.type_printers, 669 deferred_pprinters=self.deferred_printers) --> 670 printer.pretty(obj) 671 printer.flush() 672 return stream.getvalue()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return _default_pprint(obj, self, cycle) 384 finally: 385 self.end_group()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, 'repr', None) not in _baseclassreprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 _repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<')

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _reprpprint(obj, p, cycle) 692 """A pprint that just redirects to the normal repr function.""" 693 # Find newlines and replace them with p.break() --> 694 output = repr(obj) 695 for idx,output_line in enumerate(output.splitlines()): 696 if idx:

/usr/local/lib/python3.5/site-packages/h2o/frame.py in repr(self) 403 stk = traceback.extract_stack() 404 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]): --> 405 self.show() 406 return "" 407

/usr/local/lib/python3.5/site-packages/h2o/frame.py in show(self, use_pandas) 415 print("This H2OFrame has been removed.") 416 return --> 417 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill() 418 if H2ODisplay._in_ipy(): 419 import IPython.display

/usr/local/lib/python3.5/site-packages/h2o/expr.py in fill(self, rows) 306 if rows <= len(self): 307 return --> 308 res = h2o.api("GET /3/Frames/%s" % self._id, data={"row_count": rows})["frames"][0] 309 self._l = rows 310 self._nrows = res["rows"]

/usr/local/lib/python3.5/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 100 # type checks are performed in H2OConnection class 101 _check_connection() --> 102 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 103 104

/usr/local/lib/python3.5/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies) 401 self._log_end_transaction(start_time, resp) --> 402 return self._process_response(resp, save_to) 403 404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/usr/local/lib/python3.5/site-packages/h2o/backend/connection.py in _process_response(response, save_to) 723 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed") 724 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)): --> 725 raise H2OResponseError(data) 726 727 # Server errors (notably 500 = "Server Error")

H2OResponseError: Server error java.lang.IllegalArgumentException: Error: Not a String Request: GET /3/Frames/py_19_sid_8630 params: {'row_count': '10'} {code}

and

where the column to subset on is an enum {code}

H2OServerError Traceback (most recent call last) /usr/local/lib/python3.5/site-packages/IPython/core/formatters.py in call(self, obj) 668 type_pprinters=self.type_printers, 669 deferred_pprinters=self.deferred_printers) --> 670 printer.pretty(obj) 671 printer.flush() 672 return stream.getvalue()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in pretty(self, obj) 381 if callable(meth): 382 return meth(obj, self, cycle) --> 383 return _default_pprint(obj, self, cycle) 384 finally: 385 self.end_group()

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle) 501 if _safe_getattr(klass, 'repr', None) not in _baseclassreprs: 502 # A user-provided repr. Find newlines and replace them with p.break() --> 503 _repr_pprint(obj, p, cycle) 504 return 505 p.begin_group(1, '<')

/usr/local/lib/python3.5/site-packages/IPython/lib/pretty.py in _reprpprint(obj, p, cycle) 692 """A pprint that just redirects to the normal repr function.""" 693 # Find newlines and replace them with p.break() --> 694 output = repr(obj) 695 for idx,output_line in enumerate(output.splitlines()): 696 if idx:

/usr/local/lib/python3.5/site-packages/h2o/frame.py in repr(self) 402 stk = traceback.extract_stack() 403 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]): --> 404 self.show() 405 return "" 406

/usr/local/lib/python3.5/site-packages/h2o/frame.py in show(self, use_pandas) 414 print("This H2OFrame has been removed.") 415 return --> 416 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill() 417 if H2ODisplay._in_ipy(): 418 import IPython.display

/usr/local/lib/python3.5/site-packages/h2o/frame.py in _frame(self, rows, fill_cache) 473 474 def _frame(self, rows=10, fill_cache=False): --> 475 self._ex._eager_frame() 476 if fill_cache: 477 self._ex._cache.fill(rows=rows)

/usr/local/lib/python3.5/site-packages/h2o/expr.py in _eager_frame(self) 86 if not self._cache.is_empty(): return 87 if self._cache._id is not None: return # Data already computed under ID, but not cached locally ---> 88 self._eval_driver(True) 89 90 def _eager_scalar(self): # returns a scalar (or a list of scalars)

/usr/local/lib/python3.5/site-packages/h2o/expr.py in _eval_driver(self, top) 100 def _eval_driver(self, top): 101 exec_str = self._get_ast_str(top) --> 102 res = ExprNode.rapids(exec_str) 103 if 'scalar' in res: 104 if isinstance(res['scalar'], list):

/usr/local/lib/python3.5/site-packages/h2o/expr.py in rapids(expr) 205 :returns: The JSON response (as a python dictionary) of the Rapids execution 206 """ --> 207 return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id}) 208 209

/usr/local/lib/python3.5/site-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 100 # type checks are performed in H2OConnection class 101 _check_connection() --> 102 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 103 104

/usr/local/lib/python3.5/site-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 400 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies) 401 self._log_end_transaction(start_time, resp) --> 402 return self._process_response(resp, save_to) 403 404 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/usr/local/lib/python3.5/site-packages/h2o/backend/connection.py in _process_response(response, save_to) 728 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server 729 # did not provide the correct status code. --> 730 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data)) 731 732

H2OServerError: HTTP 500 Server Error: Server error water.util.DistributedException: Error: DistributedException from /127.0.0.1:54321: 'NewChunk has type Numeric, but the Vec is of type String' Request: None

{code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4714 Assignee: Navdeep Gill Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: example_1.csv Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-4714/example_1.csv