serde-rs / serde

Serialization framework for Rust
https://serde.rs/
Apache License 2.0
9.14k stars 772 forks source link

Consider a `Serializer::serialize_byte_array` method #2120

Closed KodrAus closed 2 years ago

KodrAus commented 3 years ago

Summary

Add new methods to Serializer, Deserializer, and Visitor to support cases where a binary format can take advantage of the fact that a byte slice has a fixed size.

Background

Motivations

In uuid we've been looking at the trade-offs of various representations for a 128bit value for binary formats. The current options are:

  1. Serializer::serialize_bytes using &[u8]. This is a natural fit, but requires a redundant length field, even though the value is guaranteed to always be 16 bytes. That redundancy may result in anywhere from 1 to 8 additional bytes of overhead.
  2. Serializer::serialize_tuple using [T; N]. This can avoid the need for a redundant field, but the lazy serialization may impact performance. It can also introduce more overhead in other formats that don't encode tuples as sequences.
  3. Serializer::serialize_u128 using u128. This can avoid the drawbacks of the above approaches, but support is still spotty. Some formats that need to interoperate with other languages simply won't or can't support 128bit numbers.

Each of these approaches comes with drawbacks that are problematic for different groups of end-users. They can be mitigated with the introduction of a new serialize_byte_array method:

  1. Formats can optimize away the redundant length, since the datatype has explicitly declared the byte array as having a fixed size.
  2. Formats can serialize the array as a tuple in-place without needing to run through the lazy machinery.
  3. Support for byte arrays is universal.

Proposal

Add the following method to Serializer:

trait Serializer {
    fn serialize_byte_array<const N: usize>(self, bytes: &[u8; N]) -> Result<Self::Ok, Self::Error> {
        self.serialize_bytes(bytes)
    }
}

and the following methods to Deserializer and Visitor:

trait Deserializer<'de> {
    fn deserialize_byte_array<const N: usize, V>(self, visitor: V) -> Result<V::Value, Self::Error>
    where
        V: Visitor<'de>,
    {
        self.deserialize_bytes(visitor)
    }
}

trait Visitor<'de> {
    fn visit_byte_array<const N: usize, E>(self, v: [u8; N]) -> Result<Self::Value, E>
    where
        E: Error, 
    {
        self.visit_bytes(&v)
    }
}

This is a complementary API to Serializer::serialize_bytes that allows datatypes to communicate to a format that the byte buffer has a fixed length. The format may choose to optimize that case by treating the byte buffer as a tuple instead of as a slice. serde considers [T; N] to be equivalent to (T, ..N). This proposal doesn't attempt to change that.

A datatype that serializes using serialize_byte_array will need to deserialize using deserialize_byte_array with its length supplied. The underlying format may treat that as a request for a byte buffer, a sequence, or a tuple, depending on its implementation.

Formats like bincode will need to be updated to make use of this new method in coordination with a serde release that enabled them, so they have a chance to decide what semantics they want before inheriting the default.

Drawbacks

This is an arguably niche case that increases the burden on formats. It requires coordination and consideration to support.

It may also not be possible for serde's MSRV to parse the const N: usize syntax.

Adding default methods to the Deserializer that forward to others might be an undesirable approach, because it's been a source of bugs in the past.

Alternatives

Avoid const generics in favor of something like:

fn serialize_byte_array(self, bytes: &[u8]) -> Result<Self::Ok, Self::Error> { .. }

and

fn deserialize_byte_array<V>(self, len: usize, visitor: V) -> Result<V::Value, Self::Error>
where
    V: Visitor<'de>,
{ .. }

where the length is implicitly fixed by the length of the passed in slice. This might work better for the Deserializer than trying to propagate consts. The downside of that approach is that it doesn't clearly distinguish from serialize_bytes. Besides the name they look the same.

Pass byte arrays by reference instead of by value.

Error in the default implementations instead of using a fallback. This would let formats opt-in to support in their own time, but as a widely used type, Uuid would create churn while they do.

Marwes commented 3 years ago

This isn't necessarily limited to byte arrays, it could be useful for any type of [T; N] so it might make more sense to add an option on SerializeSeq that hints that the sequence has a fixed size (similarly to https://docs.rs/serde/1.0.130/serde/trait.Serializer.html#method.is_human_readable) .

Formats which do not care would just ignore that option.

KodrAus commented 3 years ago

@Marwes You’d still need to hint to the deserializer that it’s expecting a fixed-size array, which in serde is a tuple. So looking at sequences on the serializer side and tuples on the deserializer side seems a bit asymmetric.

So I thought it might be better not to try open that can of worms and just look at fixed-size byte arrays as a complement to byte slices. They can be treated as orthogonal to arrays vs tuples, because they’re just another kind of value like a str or a u16.

KodrAus commented 3 years ago

Updated to pass arrays by reference when serializing, and by value when deserializing. That seems like a more natural fit.

KodrAus commented 2 years ago

Also thinking defaulting to unsupported rather than using a fallback is probably a better approach. Basically the same as 128bit numbers, but with an expectation that any format should be able to support it. That way formats can choose an appropriate implementation in their own time.

dtolnay commented 2 years ago

I can see how it may be appropriate to uuid, but my feeling is that this case isn't going to be broadly applicable enough to justify dedicated Serializer and Deserializer and Visitor methods, and the amount of benefit when applicable is also quite small.

Thanks anyway for the suggestion and the writeup!

sgued commented 2 years ago

Here (1 ,2) is another use case where having specific support for fixed sized arrays would be great, especially fixed sized binary blobs.

The issue is not really that the serde data model doesn't have the ability to represent fixed sized array (Tuples do this nicely), but that there is no way to tell the serializer that all objects in the the Tuple/Sequence are of the same 'type' which prevents them from optimizing how they are encoded.

In this case, Bincode users will prefer serializing [u8;N] as a Tuple because Bincode doesn't have any type information, so Tuples end up as binary blobs ([0xa1, 0xb2,0xc3] serialized as a tuple becomes 0x"a1b2c3" in bincode) while serializing via a slice would have length information, which is less efficient. Even then, encoding as a tuple probably comes at a performance cost due to using serialize_element repeatedly instead of just copying the bytes. Other formats like MessagePack or CBOR with a tag for each object will prefer serializing as a byte slice instead, because otherwise it is encoded less efficiently. [0xa1, 0xb2,0xc3] serialized as a tuple becomes 0x"93 cca1ccb2ccb3 in MessagePack because of the u8 tag 0xcc for each element (the 0x93 at the beginning is the tag for a 3 element array). Using a slice would make it use MessagePack's byte, getting rid of all the 0xcc in the middle, with a result like 0x"c403 a1b2c3" (0xc4 is the tag for binary and 0x03 its length).

This proposal would enable both Serializers to encode data optimally. A similar addition would likely be required for the Deserializing side.

hrxi commented 1 year ago

I can see how it may be appropriate to uuid, but my feeling is that this case isn't going to be broadly applicable enough to justify dedicated Serializer and Deserializer and Visitor methods, and the amount of benefit when applicable is also quite small.

This use case also appears for cryptographic things. Signatures, hashes, etc. usually have a fixed size that is known up-front.

tarcieri commented 12 months ago

I can confirm this is very painful for cryptographic use cases.

We were previously using serialize_tuple to serialize fixed-size byte arrays so as to avoid a length prefix for types whose length is known at compile time, however this encodes poorly with MessagePack.

We are going to move to serialize_bytes specifically for this reason, however this introduces a breaking change to our serde serializers: a length prefix, which is redundant because the length is fixed.

See also: https://github.com/rozbb/rust-hpke/pull/53

hrxi commented 12 months ago

This also comes with performance penalties. We saw a larger than 5x speedup by handrolling our serialization instead of using serde with postcard, just because the Rust compiler couldn't see through the loops of serialize_tuples. (On real-life tree data structures containing lots of hashes.)

https://github.com/est31/serde-big-array/issues/19