bigwater commented 5 years ago

Problem

I was trying to load a table to OmniSci from pandas data frame. I created a data frame with two columns and N_GEN rows.

from pymapd import connect
import pandas as pd
import numpy as np

con = connect(user="admin", password="HyperInteractive", host="localhost", dbname="omnisci", port=6274)

N_GEN = 2 ** 28
arr1 = np.random.rand(N_GEN)
arr2 = np.random.randint(100, size=N_GEN)
df_arr1 = pd.DataFrame(zip(arr1, arr2), columns=[ 'num', 'grp'])

print(df_arr1.info())
print(df_arr1.shape)

We use N_GEN = 2^28. The data frame uses 4.0GB memory (reported by pandas info() function).

We use the following code to insert the data frame to the DB.

import time

start = time.time()
con.execute('drop table if exists t2;')
con.load_table('t2', df_arr1)
end = time.time()

print(end - start)

However, when we try to load it to the DB, it gave an error.

OverflowError: size out of range: exceeded INT32_MAX

The error report does not make sense to me ---- 2^28 is much less than INT_MAX32, right?

I wonder why this happened and how can I fix it.

Thank you so much!

Config

pymapd                    0.17.0                     py_0    conda-forge

omnisci-os-4.8.1-20190903-e9ac6920a3

The complete error call stack:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-3-6b1dc12b0e01> in <module>
      3 start = time.time()
      4 con.execute('drop table if exists t2;')
----> 5 con.load_table('t2', df_arr1)
      6 end = time.time()
      7 

~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/pymapd/connection.py in load_table(self, table_name, data, method, preserve_index, create)
    542             if (isinstance(data, pd.DataFrame)
    543                 or isinstance(data, pa.Table) or isinstance(data, pa.RecordBatch)): # noqa
--> 544                 return self.load_table_arrow(table_name, data)
    545 
    546             elif (isinstance(data, pd.DataFrame)):

~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/pymapd/connection.py in load_table_arrow(self, table_name, data, preserve_index)
    690                                            preserve_index=preserve_index)
    691         self._client.load_table_binary_arrow(self._session, table_name,
--> 692                                              payload.to_pybytes())
    693 
    694     def render_vega(self, vega, compression_level=1):

~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in load_table_binary_arrow(self, session, table_name, arrow_stream)
   2549          - arrow_stream
   2550         """
-> 2551         self.send_load_table_binary_arrow(session, table_name, arrow_stream)
   2552         self.recv_load_table_binary_arrow()
   2553 

~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in send_load_table_binary_arrow(self, session, table_name, arrow_stream)
   2558         args.table_name = table_name
   2559         args.arrow_stream = arrow_stream
-> 2560         args.write(self._oprot)
   2561         self._oprot.writeMessageEnd()
   2562         self._oprot.trans.flush()

~/miniconda3/envs/xgbnew/lib/python3.7/site-packages/omnisci/mapd/MapD.py in write(self, oprot)
  13681     def write(self, oprot):
  13682         if oprot._fast_encode is not None and self.thrift_spec is not None:
> 13683             oprot.trans.write(oprot._fast_encode(self, [self.__class__, self.thrift_spec]))
  13684             return
  13685         oprot.writeStructBegin('load_table_binary_arrow_args')

OverflowError: size out of range: exceeded INT32_MAX

randyzwitch commented 4 years ago

Unfortunately, this looks like an error in the Arrow method. If you try load_table(..., method='columnar'), does it work?

semelianova commented 4 years ago

@randyzwitch thanks, it works for me. I had the same issue.

heavyai / pymapd

Load Table error (size out of INT32_MAX) #280

Problem

Config

The complete error call stack: