Closed KodrAus closed 2 years ago
This isn't necessarily limited to byte arrays, it could be useful for any type of [T; N]
so it might make more sense to add an option on SerializeSeq that hints that the sequence has a fixed size (similarly to https://docs.rs/serde/1.0.130/serde/trait.Serializer.html#method.is_human_readable) .
Formats which do not care would just ignore that option.
@Marwes You’d still need to hint to the deserializer that it’s expecting a fixed-size array, which in serde
is a tuple. So looking at sequences on the serializer side and tuples on the deserializer side seems a bit asymmetric.
So I thought it might be better not to try open that can of worms and just look at fixed-size byte arrays as a complement to byte slices. They can be treated as orthogonal to arrays vs tuples, because they’re just another kind of value like a str
or a u16
.
Updated to pass arrays by reference when serializing, and by value when deserializing. That seems like a more natural fit.
Also thinking defaulting to unsupported rather than using a fallback is probably a better approach. Basically the same as 128bit numbers, but with an expectation that any format should be able to support it. That way formats can choose an appropriate implementation in their own time.
I can see how it may be appropriate to uuid, but my feeling is that this case isn't going to be broadly applicable enough to justify dedicated Serializer and Deserializer and Visitor methods, and the amount of benefit when applicable is also quite small.
Thanks anyway for the suggestion and the writeup!
Here (1 ,2) is another use case where having specific support for fixed sized arrays would be great, especially fixed sized binary blobs.
The issue is not really that the serde data model doesn't have the ability to represent fixed sized array (Tuples do this nicely), but that there is no way to tell the serializer that all objects in the the Tuple/Sequence are of the same 'type' which prevents them from optimizing how they are encoded.
In this case, Bincode users will prefer serializing [u8;N] as a Tuple because Bincode doesn't have any type information, so Tuples end up as binary blobs ([0xa1, 0xb2,0xc3]
serialized as a tuple becomes 0x"a1b2c3"
in bincode) while serializing via a slice would have length information, which is less efficient. Even then, encoding as a tuple probably comes at a performance cost due to using serialize_element
repeatedly instead of just copying the bytes. Other formats like MessagePack or CBOR with a tag for each object will prefer serializing as a byte slice instead, because otherwise it is encoded less efficiently. [0xa1, 0xb2,0xc3]
serialized as a tuple becomes 0x"93 cca1ccb2ccb3
in MessagePack because of the u8
tag 0xcc
for each element (the 0x93
at the beginning is the tag for a 3 element array). Using a slice would make it use MessagePack's byte, getting rid of all the 0xcc
in the middle, with a result like 0x"c403 a1b2c3"
(0xc4
is the tag for binary and 0x03
its length).
This proposal would enable both Serializers to encode data optimally. A similar addition would likely be required for the Deserializing side.
I can see how it may be appropriate to uuid, but my feeling is that this case isn't going to be broadly applicable enough to justify dedicated Serializer and Deserializer and Visitor methods, and the amount of benefit when applicable is also quite small.
This use case also appears for cryptographic things. Signatures, hashes, etc. usually have a fixed size that is known up-front.
I can confirm this is very painful for cryptographic use cases.
We were previously using serialize_tuple
to serialize fixed-size byte arrays so as to avoid a length prefix for types whose length is known at compile time, however this encodes poorly with MessagePack.
We are going to move to serialize_bytes
specifically for this reason, however this introduces a breaking change to our serde serializers: a length prefix, which is redundant because the length is fixed.
This also comes with performance penalties. We saw a larger than 5x speedup by handrolling our serialization instead of using serde
with postcard
, just because the Rust compiler couldn't see through the loops of serialize_tuple
s. (On real-life tree data structures containing lots of hashes.)
Summary
Add new methods to
Serializer
,Deserializer
, andVisitor
to support cases where a binary format can take advantage of the fact that a byte slice has a fixed size.Background
883
Motivations
In
uuid
we've been looking at the trade-offs of various representations for a 128bit value for binary formats. The current options are:Serializer::serialize_bytes
using&[u8]
. This is a natural fit, but requires a redundant length field, even though the value is guaranteed to always be 16 bytes. That redundancy may result in anywhere from 1 to 8 additional bytes of overhead.Serializer::serialize_tuple
using[T; N]
. This can avoid the need for a redundant field, but the lazy serialization may impact performance. It can also introduce more overhead in other formats that don't encode tuples as sequences.Serializer::serialize_u128
usingu128
. This can avoid the drawbacks of the above approaches, but support is still spotty. Some formats that need to interoperate with other languages simply won't or can't support 128bit numbers.Each of these approaches comes with drawbacks that are problematic for different groups of end-users. They can be mitigated with the introduction of a new
serialize_byte_array
method:Proposal
Add the following method to
Serializer
:and the following methods to
Deserializer
andVisitor
:This is a complementary API to
Serializer::serialize_bytes
that allows datatypes to communicate to a format that the byte buffer has a fixed length. The format may choose to optimize that case by treating the byte buffer as a tuple instead of as a slice.serde
considers[T; N]
to be equivalent to(T, ..N)
. This proposal doesn't attempt to change that.A datatype that serializes using
serialize_byte_array
will need to deserialize usingdeserialize_byte_array
with its length supplied. The underlying format may treat that as a request for a byte buffer, a sequence, or a tuple, depending on its implementation.Formats like
bincode
will need to be updated to make use of this new method in coordination with aserde
release that enabled them, so they have a chance to decide what semantics they want before inheriting the default.Drawbacks
This is an arguably niche case that increases the burden on formats. It requires coordination and consideration to support.
It may also not be possible for
serde
's MSRV to parse theconst N: usize
syntax.Adding default methods to the
Deserializer
that forward to others might be an undesirable approach, because it's been a source of bugs in the past.Alternatives
Avoid const generics in favor of something like:
and
where the length is implicitly fixed by the length of the passed in slice. This might work better for the
Deserializer
than trying to propagate consts. The downside of that approach is that it doesn't clearly distinguish fromserialize_bytes
. Besides the name they look the same.Pass byte arrays by reference instead of by value.
Error in the default implementations instead of using a fallback. This would let formats opt-in to support in their own time, but as a widely used type,
Uuid
would create churn while they do.