h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

incorrectly conditionally changing string in a cell damages h2o DataFrame If there is another column that's numeric #8315

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

There is an element of pilot error on my part, BUT... I seem to have damaged a H2O DataFrame from my error. When I correct the problem, the DF isn't damaged, but it doesn't seem to do the right thing... It should be more robust to dumb errors….

The bad line that provokes the problem is:

{code:python}df[df['e'] == 'a'] = 'c'{code}

BUT I think if I use the correct code, I don't get the right answer

{code:python}df[df['e'] == 'a']['e'] = 'c' # should only change Column 'e' print(df) # but it doesn't

e n a 1 # THIS LINE SHOULD BE c 1 b 2{code}

{code:python}import h2o h2o.init() ERROR = True # Change to False to avoid error

if ERROR: data = { 'e' : ['a', 'b'], 'n' : [1,2] # creates a DF with two columns of different types } else: data = { 'e' : ['a', 'b'],

'n' : [1,2]

}

df = h2o.H2OFrame(data) print(df) print(df.types) print(df[df['e'] == 'a']['e'])

df[df['e'] == 'a']['e'] = 'c'

df[df['e'] == 'a'] = 'c' # Programming error: should be df[df['e'] == 'a']['e'] = 'c'

print(df) # Trying to display the DF causes an error

Parse progress: |█████████████████████████████████████████████████████████| 100% e n a 1 b 2

{'e': 'string', 'n': 'int'} e a


H2OResponseError Traceback (most recent call last)

in 23 df[df['e'] == 'a'] = 'c' # Programming error: should be df[df['e'] == 'a']['e'] = 'c' 24 ---> 25 print(df) /usr/local/lib/python3.6/dist-packages/h2o/frame.py in __repr__(self) 546 stk = traceback.extract_stack() 547 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]): --> 548 self.show() 549 return "" 550 /usr/local/lib/python3.6/dist-packages/h2o/frame.py in show(self, use_pandas, rows, cols) 580 print("This H2OFrame is empty.") 581 return --> 582 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill() 583 if H2ODisplay._in_zep(): 584 print("%html " + self._ex._cache._tabulate("html", False, rows=rows)) /usr/local/lib/python3.6/dist-packages/h2o/frame.py in _frame(self, rows, rows_offset, cols, cols_offset, fill_cache) 698 699 def _frame(self, rows=10, rows_offset=0, cols=-1, cols_offset=0, fill_cache=False): --> 700 self._ex._eager_frame() 701 if fill_cache: 702 self._ex._cache.fill(rows=rows, rows_offset=rows_offset, cols=cols, cols_offset=cols_offset) /usr/local/lib/python3.6/dist-packages/h2o/expr.py in _eager_frame(self) 93 if not self._cache.is_empty(): return 94 if self._cache._id is not None: return # Data already computed under ID, but not cached locally ---> 95 self._eval_driver(True) 96 97 def _eager_scalar(self): # returns a scalar (or a list of scalars) /usr/local/lib/python3.6/dist-packages/h2o/expr.py in _eval_driver(self, top) 111 def _eval_driver(self, top): 112 exec_str = self._get_ast_str(top) --> 113 res = ExprNode.rapids(exec_str) 114 if 'scalar' in res: 115 if isinstance(res['scalar'], list): /usr/local/lib/python3.6/dist-packages/h2o/expr.py in rapids(expr) 239 :returns: The JSON response (as a python dictionary) of the Rapids execution 240 """ --> 241 return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id}) 242 243 /usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to) 121 # type checks are performed in H2OConnection class 122 _check_connection() --> 123 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to) 124 125 /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to) 476 save_to = save_to(resp) 477 self._log_end_transaction(start_time, resp) --> 478 return self._process_response(resp, save_to) 479 480 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e: /usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in _process_response(response, save_to) 822 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed") 823 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)): --> 824 raise H2OResponseError(data) 825 826 # Server errors (notably 500 = "Server Error") H2OResponseError: Server error java.lang.IllegalArgumentException: Error: Cannot assign value c into a vector of type Numeric. Request: POST /99/Rapids data: {'ast': "(tmp= py_37_sid_889e (:= Key_Frame__upload_84ff3042a415fdb87b09e9882bd14b12.hex 'c' [] (== (cols_py Key_Frame__upload_84ff3042a415fdb87b09e9882bd14b12.hex 'e') 'a')))", 'session_id': '_sid_889e'}{code} {noformat}H2O Version: 3.28.0.3 Python 3.6.9 Ubuntu 18.04.3 LTS{noformat}
h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7322 Assignee: Michal Kurka Reporter: Clem Wang State: Open Fix Version: Backlog Attachments: N/A Development PRs: N/A