lebedov / msgpack-numpy

Serialize numpy arrays using msgpack
Other
194 stars 33 forks source link

Cannot create an OBJECT array from memory buffer #46

Closed econtal closed 2 years ago

econtal commented 3 years ago

Since msgpack_numpy is using np.frombuffer, it does not support de-serializing arrays with object dtypes.

>>> import msgpack
>>> import msgpack_numpy
>>> msgpack_numpy.patch()
>>> array = numpy.array(['ab', 'cd', 'ef'], dtype='O')
>>> data = msgpack.dumps(array)
>>> msgpack.loads(data)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-e433212c2b17> in <module>
----> 1 msgpack.loads(data)

~/Library/Python/3.7/lib/python/site-packages/msgpack_numpy.py in unpackb(packed, **kwargs)
    271     object_hook = kwargs.get('object_hook')
    272     kwargs['object_hook'] = functools.partial(decode, chain=object_hook)
--> 273     return _unpackb(packed, **kwargs)
    274
    275 load = unpack

msgpack/_unpacker.pyx in msgpack._cmsgpack.unpackb()

~/Library/Python/3.7/lib/python/site-packages/msgpack_numpy.py in decode(obj, chain)
     89                     descr = obj[b'type']
     90                 return np.frombuffer(obj[b'data'],
---> 91                             dtype=_unpack_dtype(descr)).reshape(obj[b'shape'])
     92             else:
     93                 descr = obj[b'type']

ValueError: cannot create an OBJECT array from memory buffer

Versions used:

msgpack==1.0.0
msgpack-numpy==0.4.7.1
numpy==1.19.2

A workaround is to avoid using:

np.frombuffer(obj[b'data'], dtype=_unpack_dtype(descr)).reshape(obj[b'shape'])

and replace by:

np.ndarray(buffer=obj[b'data'], dtype=_unpack_dtype(descr), shape=obj[b'shape'])
gerner commented 3 years ago

I'm also running into this.

I'm not sophisticated enough with numpy to know if there are other consequences of the change from @econtal but FWIW, it works for me on at least a few trivial cases.

gerner commented 3 years ago

Perhaps I spoke too soon. #47 works in a trivial case, but I think if you write it to disk and try to read it again in a different process it segfaults since the buffer you're writing is referencing memory pointers or something like that and the actual data isn't persisted.

econtal commented 3 years ago

Indeed I confused how the data was stored for numpy arrays in msgpack. Contrary to pickle or similar serialization, here obj.tobytes() is used, which for dtype='O' only stores the memory addresses of python objects, not the actual data.

So without changing how the data is serialized, there is no hope in fixing the deserialization.

I will have to look precisely into how the msgpack specifies an array of objects (e.g. variable length strings) must be stored, and first fix the serialization.

I will also fix the test and make it fail, to account for changing memory addresses.

econtal commented 3 years ago

Actually msgpack-numpy doesn't follow the specifications of msgpack's arrays, which doesn't have any optimization for C-like typed arrays like numpy. So I guess there is more freedom to implement this feature 😅

But if we're not trying to follow the specs, this may mean loosing compatibility with other languages. In which case wouldn't using pickle to serialize the object make sense?

jonathansp commented 2 years ago

any updates on this?

lebedov commented 2 years ago

As using pickle for serialization negates part of the reason for using msgpack, my feeling is that use cases that require the latter should either not use msgpack or define some encoder/decoder to properly handle that use case (per comments made in https://github.com/lebedov/msgpack-numpy/pull/47). That said, enabling msgpack-numpy to work out of the box for arrays with dtype='O' seems like a reasonable fallback from a usability perspective. I added support for the latter based on the work by @airwoodix and a cautionary note in the README.

asgillmor commented 2 years ago

Any chance you can publish this 0.4.8 to the python registry?

I do not see it there: https://pypi.org/project/msgpack-numpy/#history

thanks!

lebedov commented 2 years ago

Done.

Closing this issue for now.