lebedov / msgpack-numpy

Serialize numpy arrays using msgpack
Other
194 stars 33 forks source link

replace `np.frombuffer` by `np.ndarray` to support object arrays #47

Closed econtal closed 2 years ago

econtal commented 3 years ago

Fixes #46

Avoid using:

np.frombuffer(obj[b'data'], dtype=_unpack_dtype(descr)).reshape(obj[b'shape'])

and replace by:

np.ndarray(buffer=obj[b'data'], dtype=_unpack_dtype(descr), shape=obj[b'shape'])
airwoodix commented 2 years ago

@econtal did you get a chance to work on this patch?

It seems like the issues with shallow references mentioned in #46 can be circumvented by pickling arrays of objects rather than calling .tobytes(). This is for example what's done in ipython/ipython#4937.

I put a minimal version together at airwoodix/msgpack-numpy@63c0121. If this also works for you and @gerner I'm happy to submit a more comprehensive PR.

gerner commented 2 years ago

Our use case is in serializing dataframes that could include columns of supported primitive types like ints and doubles, as well as strings. We ended up checking if the column type is 'object' and convert to a list and use the underlying msgpack serializer. When we deserialize, if it comes back as a list we just turn it into an numpy.array with dtype 'object'. This works well because strings are natively supported by msgpack. And if the object is not supported, msgpack will handle it as if I'd asked it to pack some other complicated type at the top level, e.g. fall back to the default hook.

Personally, I dislike the idea of using pickle to automatically support all types. In the case of strings, there's already natural support in msgpack and I think you should use that format, but for other types, it seems like you want some user support. If they want to use pickle, then that's great, they should have a hook to do that. This is what the msgpack API already supports via the default parameter.

As a thought exercise, suppose the the numpy array were an array of other numpy arrays. In this case the top level array will be packed by msgpack-numpy. And then each other array will be pickled? It seems like you should just be packing recursively, however the packer is configured to me.

I'm imagining that you take something like the approach I suggest above, convert to list, pack it. If the type isn't something natively supported (strings, any others?), you fall back to some other user-defined packing option.

Perhaps a compromise would be to have pickle as an option (maybe even default), but let the user turn that off or override it.

I also appreciate that I'm a fairly peripheral user of msgpack-numpy, so maybe more serious maintainers can weight in. But those are my 2 cents.

lebedov commented 2 years ago

Fix for #46 added to latest version.