uchicago-cs / deepdish

Flexible HDF5 saving/loading and other data science tools from the University of Chicago
http://deepdish.io
BSD 3-Clause "New" or "Revised" License
270 stars 59 forks source link

Crashes when dealing with large datasets #34

Open juliaroquette opened 5 years ago

juliaroquette commented 5 years ago

I am trying to use deepdish to store/restore large datasets in the HDF5 format, but deepdish.io.save crashes every time the dataset is larger than about 2GB.

For example, suppose we have a very large array: t=bytearray(8*1000*1000*400) when I try: dd.io.save('testeDeepdishLimit',t) I get the error:

---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-3-26ecd71b151a> in <module>()
----> 1 dd.io.save('testeDeepdishLimit',t)

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in save(path, data, compression)
    594         else:
    595             _save_level(h5file, group, data, name='data',
--> 596                         filters=filters, idtable=idtable)
    597             # Mark this to automatically unpack when loaded
    598             group._v_attrs[DEEPDISH_IO_UNPACK] = True

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in _save_level(handler, group, level, name, filters, idtable)
    302 
    303     else:
--> 304         _save_pickled(handler, group, level, name=name)
    305 
    306 

~/anaconda3/lib/python3.6/site-packages/deepdish/io/hdf5io.py in _save_pickled(handler, group, level, name)
    170                   DeprecationWarning)
    171     node = handler.create_vlarray(group, name, tables.ObjectAtom())
--> 172     node.append(level)
    173 
    174 

~/anaconda3/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
    535             nparr = None
    536 
--> 537         self._append(nparr, nobjects)
    538         self.nrows += 1
    539 

tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()

OverflowError: value too large to convert to int

Is there any workaround for this issue?

twmacro commented 5 years ago

I can confirm that I get the same error when I try your example (on a Linux machine and on a Windows machine). I think the error is within PyTables. See https://github.com/PyTables/PyTables/pull/550.