lucianopaz / compress_pickle

Standard python pickle, thinly wrapped with standard compression libraries
MIT License
42 stars 11 forks source link

"object of type 'pickle.PickleBuffer' has no len()" error with large numpy arrays #23

Closed PaulFlanaganGenscape closed 3 years ago

PaulFlanaganGenscape commented 3 years ago

I get object of type 'pickle.PickleBuffer' has no len() error for any compression other than gzip if data contains a large numpy array

It works for small numpy arrays

I'm pretty sure it's same issue as https://github.com/pandas-dev/pandas/pull/39376

ipython
Python 3.9.0 (default, Oct 13 2020, 14:30:47)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import compress_pickle
   ...: import numpy as np
   ...: import pickle
   ...:
   ...: dnp = {"np_array": np.zeros((100, 37000, 3))}
   ...:
   ...: pickled = pickle.dumps(dnp)
   ...:

In [2]: len(pickled)
Out[2]: 88800178

In [3]:
   ...: pickled = compress_pickle.dumps(dnp, compression='gzip')
   ...: len(pickled)
Out[3]: 86506

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-0201a3824617> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='zipfile')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    125                 io_stream.write(buff)
    126             else:
--> 127                 pickle.dump(  # type: ignore
    128                     obj,
    129                     io_stream,

~/.pyenv/versions/3.9.0/lib/python3.9/zipfile.py in write(self, data)
   1121         if self.closed:
   1122             raise ValueError('I/O operation on closed file.')
-> 1123         nbytes = len(data)
   1124         self._file_size += nbytes
   1125         self._crc = crc32(data, self._crc)

TypeError: object of type 'pickle.PickleBuffer' has no len()

In [5]: pickled = compress_pickle.dumps(dnp, compression='lz4')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-90f985e33a25> in <module>
----> 1 pickled = compress_pickle.dumps(dnp, compression='lz4')

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dumps(obj, compression, protocol, fix_imports, buffer_callback, optimize, **kwargs)
    206     validate_compression(compression, infer_is_valid=False)
    207     with io.BytesIO() as stream:
--> 208         dump(
    209             obj,
    210             path=stream,

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    149                 io_stream.write(buff)
    150             else:
--> 151                 pickle.dump(obj, io_stream, protocol=protocol, fix_imports=fix_imports)
    152         finally:
    153             io_stream.flush()

~/git/Genscape/dispatch-modeler/.venv/lib/python3.9/site-packages/lz4/frame/__init__.py in write(self, data)
    694         compressed = self._compressor.compress(data)
    695         self._fp.write(compressed)
--> 696         self._pos += len(data)
    697         return len(data)
    698

TypeError: object of type 'pickle.PickleBuffer' has no len()

small numpy array

In [1]: import compress_pickle
   ...: import numpy as np

In [2]: dnp = {"np_array": np.zeros((10, 37, 3))}

In [3]: compress_pickle.utils.get_known_compressions()
Out[3]: [None, 'pickle', 'gzip', 'bz2', 'lzma', 'zipfile', 'lz4']

In [4]: pickled = compress_pickle.dumps(dnp, compression='zipfile')

In [5]: len(pickled)
Out[5]: 9136

In [6]: unpickled = compress_pickle.loads(pickled, compression='zipfile')

In [8]: unpickled['np_array'].shape
Out[8]: (10, 37, 3)
lucianopaz commented 3 years ago

Thanks for reporting this @PaulFlanaganGenscape! From the PR that you linked, it looks like protocol 5 is breaking something. Could you try if compress_pickle.dump(..., protocol=4) works?

When I find some time, I'll port the solution that the pandas team did over on their PR here.

PaulFlanaganGenscape commented 3 years ago

yes, you're right. It works with protocol=4

In [61]: lb, ub = -1, 1
    ...: x = np.random.uniform(low=lb,high=ub,size=(1,100000000))

In [62]: humanize.naturalsize( x.nbytes )
Out[62]: '800.0 MB'

In [63]: dump(x, "x.pkl.bz", compression="bz2", protocol=4)

In [64]: dump(x, "x.pkl.bz", compression="bz2")

TypeError                                 Traceback (most recent call last)
<ipython-input-87-5d854cdb6283> in <module>
----> 1 dump(x, "x.pkl.bz", compression="bz2")

.venv/lib/python3.9/site-packages/compress_pickle/compress_pickle.py in dump(obj, path, compression, mode, protocol, fix_imports, buffer_callback, unhandled_extensions, set_default_extension, optimize, **kwargs)
    149                 io_stream.write(buff)
    150             else:
--> 151                 pickle.dump(obj, io_stream, protocol=protocol, fix_imports=fix_imports)
    152         finally:
    153             io_stream.flush()

~/.pyenv/versions/3.9.0/lib/python3.9/bz2.py in write(self, data)
    234             compressed = self._compressor.compress(data)
    235             self._fp.write(compressed)
--> 236             self._pos += len(data)
    237             return len(data)
    238

TypeError: object of type 'pickle.PickleBuffer' has no len()

In [65]:
dom-insytesys commented 3 years ago

I'm having the same problem pickling a Pandas DataFrame. Switching to protocol=4 makes it work.

lucianopaz commented 3 years ago

Closed by #26

ghost commented 3 years ago

It's a similar bug of https://bugs.python.org/issue44439 I will create an issue in Python issue tracker about this later.