lebedov / msgpack-numpy

Serialize numpy arrays using msgpack
Other
197 stars 33 forks source link

Problem with multidimensional arrays #41

Closed matiasdahl closed 4 years ago

matiasdahl commented 4 years ago

Hi. There seems to be a problem with handling multidimensional numpy arrays. At least on linux and python 3.8.

To reproduce

docker run -it --rm -p 5678:5678 python:3.8.1-slim-buster /bin/bash

apt -y update 
apt -y upgrade

python --version   # Python 3.8.1

pip3 install jupyter==1.0.0 msgpack==0.6.2 msgpack-numpy==0.4.6.post0 numpy==1.18.2

# Start jupyter notebook in docker
jupyter notebook --port 5678 --ip 0.0.0.0 --allow-root

# Opening a notebook browser window gives the following environment
import sys
print(sys.platform) # linux
print(sys.version) # 3.8.1 (default, Feb  2 2020, 08:49:34) [GCC 8.3.0]

The below example from the README works. We can serialize and deserialize 1d numpy arrays.

import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
x_enc = msgpack.packb(x, default=m.encode)
assert len(x_enc) == 76

x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
assert (x == x_rec).all()

However, the above is a 1d-array. If we try to serialize a 2D (or higher dimensional) array the below code shows that the size of the serialized data does increase. However, when the data is deserialized, the result is somehow only an array of length 5 (?).

x = np.random.rand(5, 4000)
assert x.shape == (5, 4000)
x_enc = msgpack.packb(x, default=m.encode)
assert len(x_enc) == 160039

_ = msgpack.unpackb(x_enc, object_hook=m.decode) 

Here the last line fails with the below error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-013267d9f88a> in <module>
      4 assert len(x_enc) == 160039
      5 
----> 6 _ = msgpack.unpackb(x_enc, object_hook=m.decode)

/usr/local/lib/python3.8/site-packages/msgpack/fallback.py in unpackb(packed, **kwargs)
    134     unpacker.feed(packed)
    135     try:
--> 136         ret = unpacker._unpack()
    137     except OutOfData:
    138         raise ValueError("Unpack failed: incomplete input")

/usr/local/lib/python3.8/site-packages/msgpack/fallback.py in _unpack(self, execute)
    659                     ret[key] = self._unpack(EX_CONSTRUCT)
    660                 if self._object_hook is not None:
--> 661                     ret = self._object_hook(ret)
    662             return ret
    663         if execute == EX_SKIP:

/usr/local/lib/python3.8/site-packages/msgpack_numpy.py in decode(obj, chain)
     87                 else:
     88                     descr = obj[b'type']
---> 89                 return np.frombuffer(obj[b'data'],
     90                             dtype=np.dtype(descr)).reshape(obj[b'shape'])
     91             else:

ValueError: cannot reshape array of size 5 into shape (5,4000)

If I read the code for msgpack-numpy correctly, the below is essentially the serialize-deserialize logic. This does work, so this suggests that the problem in on msgpack:s side and how it handle:s memoryview objects (?)

x = np.random.rand(5, 4000)
serialized_data = x.data 
assert type(serialized_data) == memoryview

x_rec = np.frombuffer(serialized_data, dtype=np.dtype(x.dtype.str)).reshape(x.shape)
assert (x == x_rec).all()

Possible workarounds

The below shows that serialization/deserialization works when the memory layout is not C-contiguous (and then obj.tobytes() is used instead of obj.data), msgpack-numpy source:

x0 = np.random.rand(5, 4000)
assert x0.flags['C_CONTIGUOUS'] == True

# transpose does not rewrite the array in memory. Thus x0.T has C_CONTIGUOUS=False
# and this makes it possible to build an array which is identical to x0, but
# with C_CONTIGUOUS=False
x = np.ascontiguousarray(x0.T).T
assert x.flags['C_CONTIGUOUS'] == False
assert (x0 == x).all()

# Now x can be serialized and deserialized
x_enc = msgpack.packb(x, default=m.encode)
assert len(x_enc) == 160041

x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
assert (x == x_rec).all()

This suggests the following work-around

m.ndarray_to_bytes = lambda obj: bytes(obj.data)
# or m.ndarray_to_bytes = lambda obj: obj.tobytes()

Then serialization for multidim arrays works with the same code as in the README

x = np.random.rand(1, 2, 3, 4, 5, 6)
assert x.shape == (1, 2, 3, 4, 5, 6)
x_enc = msgpack.packb(x, default=m.encode)
assert len(x_enc) == 5801

x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
assert (x == x_rec).all()

Unit tests

When running the unit tests (in the above docker environment) I get three failing tests and the first two seem to be related to the above issue.

...............EE........F
======================================================================
ERROR: test_numpy_array_float_2d (__main__.test_numpy_msgpack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 199, in test_numpy_array_float_2d
    x_rec = self.encode_decode(x)
  File "tests.py", line 31, in encode_decode
    return msgpack.unpackb(x_enc, use_list=use_list,
  File "/msgpack-numpy/msgpack_numpy.py", line 255, in unpackb
    return _unpackb(packed, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/msgpack/fallback.py", line 136, in unpackb
    ret = unpacker._unpack()
  File "/usr/local/lib/python3.8/site-packages/msgpack/fallback.py", line 661, in _unpack
    ret = self._object_hook(ret)
  File "/msgpack-numpy/msgpack_numpy.py", line 89, in decode
    return np.frombuffer(obj[b'data'],
ValueError: cannot reshape array of size 5 into shape (5,5)

======================================================================
ERROR: test_numpy_array_float_2d_macos (__main__.test_numpy_msgpack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 208, in test_numpy_array_float_2d_macos
    x_rec = self.encode_decode(x, use_list=False, max_bin_len=50000000)
  File "tests.py", line 31, in encode_decode
    return msgpack.unpackb(x_enc, use_list=use_list,
  File "/msgpack-numpy/msgpack_numpy.py", line 255, in unpackb
    return _unpackb(packed, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/msgpack/fallback.py", line 136, in unpackb
    ret = unpacker._unpack()
  File "/usr/local/lib/python3.8/site-packages/msgpack/fallback.py", line 661, in _unpack
    ret = self._object_hook(ret)
  File "/msgpack-numpy/msgpack_numpy.py", line 89, in decode
    return np.frombuffer(obj[b'data'],
ValueError: cannot reshape array of size 5 into shape (5,5)

======================================================================
FAIL: test_str (__main__.test_numpy_msgpack)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 62, in test_str
    assert_equal(type(self.encode_decode(u'foo')), str)
  File "/usr/local/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 428, in assert_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
 ACTUAL: <class 'bytes'>
 DESIRED: <class 'str'>

Both test_numpy_array_float_2d and test_numpy_array_float_2d_macos pass if ndarray_to_bytes is replaced with either of the above alternatives.

This become rather long issue. Please let me know if I can provide more details or if something is unclear.

matiasdahl commented 4 years ago

Please let me know if I can make a PR. However, I would need some guidance on what should be changed.

lebedov commented 4 years ago

Using a fresh conda environment on Ubuntu 20.04.1 containing python 3.8.1 (from conda-forge), I tried installing msgpack 0.6.2, numpy 1.18.2, and msgpack-numpy 0.4.6.post0 with pip. I was unable to replicate the problem using this setup; when run directly with python at the console (i.e., without Jupyter), it executed successfully without raising any exception:

import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
x_enc = msgpack.packb(x, default=m.encode)

x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
np.testing.assert_array_equal(x, x_rec)

x = np.random.rand(5, 4000)
x_enc = msgpack.packb(x, default=m.encode)
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
np.testing.assert_array_equal(x, x_rec)

The python binary provided by conda-forge is built with a different version of gcc than that in the docker image you are using, but I'm not sure why that would make a difference.

Incidentally, msgpack-numpy deliberately uses memoryview when possible to avoid the slight slowdown imposed by invocation of tobytes() (which does make a difference in execution time when serialization/deserialization is performed repeatedly).

Out of curiosity, can you try using more recent versions of msgpack (1.0.0) and numpy (1.19.1) and see what happens?

matiasdahl commented 4 years ago

Hi. Perfect! Yes, upgrading the libraries to msgpack (1.0.0) and numpy (1.19.1) fixed the issue. Now it works (both when running in python and jypyter) when running in docker. I also checked and all unit tests pass after upgrading.

I should have checked the library versions first. No idea why I was running old versions.

Thank you for you help, and thank you for this library :+1:

I am taking the liberty of closing this issue.

lebedov commented 4 years ago

Great. It still is puzzling why I couldn't reproduce the issue with the exact same versions of the packages that you indicated. There is support for the pre-1.0.0 versions of msgpack in the code, but in light of your experience I think its time to make 1.0.0 a hard requirement.