jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.26k stars 66 forks source link

Is msgspec suitable for high-performance serialization of binary data? #174

Open shenker opened 2 years ago

shenker commented 2 years ago

I have a small number of large numpy arrays (1-100MB total) stored in a structured way, with some additional JSON metadata. Something like:

{
    "my_array": numpy.array([0, 1, 2, ...]),
    "list_of_array": [numpy.array([0, 0, 0]), numpy.array([1, 1, 1]), ...],
    "dict_of_arrays": {"a": numpy.array([0, 0, 0]), "b": numpy.array([1, 1, 1]), ...},
    "some_metadata": "blah",
    "some_other_metadata": {"a": 1, "b": 2}
}

Let's say I want to send data structures like that over a socket (e.g., zeromq). I could pickle, but would prefer a serialization format that's more version- and language-agnostic.

Is msgspec (in JSON or msgpack mode) suitable for this use case? Are there any other considerations I should keep in mind? (I know a couple years ago a lot of work went into pickle5/dask comms to make serialization zero-copy and fast for this kind of use case.)

jack-mcivor commented 1 year ago

I also would like to serialize numpy arrays into msgpack. At the moment I'm using ndarray.tolist() in a encoding/decoding hook, but this is slow because the array is copied into python builtins. (Originally suggested here).

Please let me know if anyone has any suggestions

saolof commented 1 year ago

Seconding this. Numpy arrays are my main reason for wanting to switch to a binary serialization format.

rns70 commented 1 year ago

The fastest way I've found to serialize and deserialize Numpy arrays to msgpack using msgspec:

import msgspec.msgpack
import numpy as np
import timeit

class NumpySerializedRepresentation(msgspec.Struct, gc=False, array_like=True):
    dtype:str
    shape:tuple
    data:bytes

numpy_array_encoder = msgspec.msgpack.Encoder()
numpy_array_decoder = msgspec.msgpack.Decoder(type=NumpySerializedRepresentation)

def encode_hook(obj):
    if isinstance(obj, np.ndarray):
        return msgspec.msgpack.Ext(1, numpy_array_encoder.encode(NumpySerializedRepresentation(dtype=obj.dtype.str, shape=obj.shape, data=obj.data)))
    return obj

def ext_hook(type, data:memoryview):
    if type == 1:
        serialized_array_rep = numpy_array_decoder.decode(data)
        return np.frombuffer(serialized_array_rep.data, dtype=serialized_array_rep.dtype).reshape(serialized_array_rep.shape)
    return data

dec = msgspec.msgpack.Decoder(ext_hook=ext_hook)
enc = msgspec.msgpack.Encoder(enc_hook=encode_hook)

############
# BENCHMARK
import pandas as pd

def benchmark(func, *args, text="", **kwargs):
    results = timeit.repeat(lambda: func(*args, **kwargs), number=1, repeat=50)
    mean = np.array(results).mean()
    return mean

time_per_num_of_bytes_enc = {}
time_per_num_of_bytes_dec = {}
sizes = [2**i for i in range(1, 12)]
for size in sizes:
    arr = np.random.rand(size, size)
    time_enc = benchmark(enc.encode, arr)
    encoded = enc.encode(arr)
    time_dec = benchmark(dec.decode, encoded)
    time_per_num_of_bytes_enc[arr.nbytes] = time_enc
    time_per_num_of_bytes_dec[arr.nbytes] = time_dec

df = pd.DataFrame.from_dict(time_per_num_of_bytes_enc, orient="index", columns=["encode time (s)"])
df.index.name = "size (bytes)"
df["decode time (s)"] = time_per_num_of_bytes_dec.values()
print(df)

This gives the following results on my machine:

              encode time (s)  decode time (s)
size (bytes)                                  
32                   0.000002         0.000002
128                  0.000002         0.000002
512                  0.000002         0.000002
2048                 0.000002         0.000002
8192                 0.000003         0.000003
32768                0.000005         0.000004
131072               0.000012         0.000007
524288               0.000274         0.000018
2097152              0.001283         0.000059
8388608              0.002074         0.000218
33554432             0.009023         0.004744

I'm would love to know whether anyone found a faster way!

EDIT: This only works for the most basic case of Numpy array. Notably it does not work when array kind (arr.dtype.kind) is object or void, it also does not work when the C_contiguous flag is not set. See the code in the comment below for most of the edge cases.

jack-mcivor commented 1 year ago

@rensje nice! I think this is similar in spirit to msgpack_numpy - see https://github.com/lebedov/msgpack-numpy/blob/fd7032a3045f268f84d26baa7cc9fb7f3cfec99d/msgpack_numpy.py#L54

rns70 commented 1 year ago

@jack-mcivor Thanks! Its good to see all the edge cases when trying to (de)serialize numpy arrays listed out like that. I'll modify my comment to make it clear that this only works for the most basic case.

kemingy commented 10 months ago

I have the same use case but didn't find this issue early. I have created a library numbin that is similar to @rensje 's answer. It also uses msgpack when you need complex type (let's say numpy with Python dict or something else). The different is that it re-use the numpy.array.tobytes() instead of the msgpack representation. It might be faster for pure numpy object.