jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.45k stars 76 forks source link

Consideration for decode array to `tuple` by default option. #30

Closed goodboy closed 3 years ago

goodboy commented 3 years ago

msgpack-python has an option: use_list=False to its unpacker to allow for decoding to tuple by default.

I noticed in the docs that tuples are only used for array types when used as hashable keys.

Is there a reason there isn't a way to either offer the tuple-as-default by a manual flag or, just by default decode to the same type considering they're ostensibly more performant in python then list?

jcrist commented 3 years ago

Apologies for the delay here. Tuples aren't more performant than lists to create or use. If you read through the answers in the link above, you'd see that only constant tuples (e.g. (1, 2, 3)) are "faster" since they're built only once by the compiler. Both lists and tuples have similar representations in cpython, and take equivalent time to construct dynamically. A quick benchmark using msgspec:

In [5]: data = list(range(1000))

In [6]: dec_list = msgspec.Decoder(list)

In [7]: dec_tuple = msgspec.Decoder(tuple)

In [8]: buf = msgspec.encode(data)

In [9]: %timeit dec_list.decode(buf)
12.9 µs ± 20.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit dec_tuple.decode(buf)
12.9 µs ± 17.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I'm not enthused about adding a use_list-like option. Lists are the natural default type for MessagePack's array type. If you want to use a different type for arrays then you likely have a schema you're following and I'd direct you to use msgspec's support for typed serialization.

goodboy commented 3 years ago

@jcrist learn something new every time I report something here 🏄🏼

You'd think i would have double checked the tuple create speed claim 🙄 Is it possible the .encode() step here is faster though?

Honestly, keeping it as is works for me as simpler is always better imo. I can close this is no one else is going to have quiffs.

and I'd direct you to use msgspec's support for typed serialization.

Yeah i think focusing on a struct schema is really the right way to designing things anyway 👍🏼

jcrist commented 3 years ago

No problem, happy to help.

Is it possible the .encode() step here is faster though?

Both store their data as an array of PyObject*, so I wouldn't expect a difference. Easy enough to benchmark though:

In [1]: import msgspec

In [2]: enc = msgspec.Encoder()

In [3]: msg_tuple = tuple(range(1000))

In [4]: msg_list = list(range(1000))

In [5]: %timeit enc.encode(msg_tuple)
10.3 µs ± 26.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: %timeit enc.encode(msg_list)
10.3 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I can close this is no one else is going to have quiffs.

Closing!