Open bigwater opened 5 years ago
I was trying to load a table to OmniSci from pandas data frame. I created a data frame with two columns and N_GEN rows.
from pymapd import connect import pandas as pd import numpy as np con = connect(user="admin", password="HyperInteractive", host="localhost", dbname="omnisci", port=6274) N_GEN = 2 ** 28 arr1 = np.random.rand(N_GEN) arr2 = np.random.randint(100, size=N_GEN) df_arr1 = pd.DataFrame(zip(arr1, arr2), columns=[ 'num', 'grp']) print(df_arr1.info()) print(df_arr1.shape)
We use N_GEN = 2^28. The data frame uses 4.0GB memory (reported by pandas info() function).
We use the following code to insert the data frame to the DB.
import time start = time.time() con.execute('drop table if exists t2;') con.load_table('t2', df_arr1) end = time.time() print(end - start)
However, when we try to load it to the DB, it gave an error.
OverflowError: size out of range: exceeded INT32_MAX
The error report does not make sense to me ---- 2^28 is much less than INT_MAX32, right?
I wonder why this happened and how can I fix it.
Thank you so much!
pymapd 0.17.0 py_0 conda-forge
omnisci-os-4.8.1-20190903-e9ac6920a3
--------------------------------------------------------------------------- OverflowError Traceback (most recent call last) <ipython-input-3-6b1dc12b0e01> in <module> 3 start = time.time() 4 con.execute('drop table if exists t2;') ----> 5 con.load_table('t2', df_arr1) 6 end = time.time() 7 ~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/pymapd/connection.py in load_table(self, table_name, data, method, preserve_index, create) 542 if (isinstance(data, pd.DataFrame) 543 or isinstance(data, pa.Table) or isinstance(data, pa.RecordBatch)): # noqa --> 544 return self.load_table_arrow(table_name, data) 545 546 elif (isinstance(data, pd.DataFrame)): ~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/pymapd/connection.py in load_table_arrow(self, table_name, data, preserve_index) 690 preserve_index=preserve_index) 691 self._client.load_table_binary_arrow(self._session, table_name, --> 692 payload.to_pybytes()) 693 694 def render_vega(self, vega, compression_level=1): ~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in load_table_binary_arrow(self, session, table_name, arrow_stream) 2549 - arrow_stream 2550 """ -> 2551 self.send_load_table_binary_arrow(session, table_name, arrow_stream) 2552 self.recv_load_table_binary_arrow() 2553 ~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in send_load_table_binary_arrow(self, session, table_name, arrow_stream) 2558 args.table_name = table_name 2559 args.arrow_stream = arrow_stream -> 2560 args.write(self._oprot) 2561 self._oprot.writeMessageEnd() 2562 self._oprot.trans.flush() ~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in write(self, oprot) 13681 def write(self, oprot): 13682 if oprot._fast_encode is not None and self.thrift_spec is not None: > 13683 oprot.trans.write(oprot._fast_encode(self, [self.__class__, self.thrift_spec])) 13684 return 13685 oprot.writeStructBegin('load_table_binary_arrow_args') OverflowError: size out of range: exceeded INT32_MAX
Unfortunately, this looks like an error in the Arrow method. If you try load_table(..., method='columnar'), does it work?
load_table(..., method='columnar')
@randyzwitch thanks, it works for me. I had the same issue.
Problem
I was trying to load a table to OmniSci from pandas data frame. I created a data frame with two columns and N_GEN rows.
We use N_GEN = 2^28. The data frame uses 4.0GB memory (reported by pandas info() function).
We use the following code to insert the data frame to the DB.
However, when we try to load it to the DB, it gave an error.
The error report does not make sense to me ---- 2^28 is much less than INT_MAX32, right?
I wonder why this happened and how can I fix it.
Thank you so much!
Config
The complete error call stack: