near / borsh-rs

Rust implementation of Binary Object Representation Serializer for Hashing
https://borsh.io/
Apache License 2.0
299 stars 65 forks source link

BorshSchema vs custom serialisation #211

Closed mina86 closed 11 months ago

mina86 commented 1 year ago

Say I’d like to use varint in borsh. Or have a custom SmallVec type which is encoded with 8-bit length rather than 32-bit length.

This is easy enough to do by implementing custom BorshSerialize and BorshDeserialize. However, BorshSchema becomes an issue. Varint could be modelled as a nested enum with 256 variants. Similarly SmallVec could be modeled as an enum with 256 variants each being an array. That’s hardly a clean solution though.

Do you guys have any thoughts on that?

frol commented 1 year ago

I would avoid expanding the scope of borsh spec with varint/smallvec specializations. I would treat these types as application-specific ones and leave app developers to optimize their custom types on their end.

mina86 commented 1 year ago

So my question is how do I implement BorshSchema for such type? There’s no Definition for an application-specific encoding. The options seems to be:

Perhaps it would make sense to have Definition::AppSpecific with some at least rudimentary description of the format (e.g. min and max encoded length). For varint for example this would mean a definition "VarInt<u32>"Definition::AppSpecific(1..5).

I think this also maybe relates to https://github.com/near/borsh-rs/issues/181. Perhaps it would make sense to extend Sequence and Enum by adding length_size and tag_size fields respectively? So currently we’d have Sequence { length_size: 4, elements: ... } and Enum { tag_size: 1, variants: ... }. This would allow expressing smallvec and enums with different tag representation.

dj8yfo commented 1 year ago

A vector of varints Vec<VarInt> can be serialized as Vec<u8> first and then presented as that to borsh, if the need for compression, that varint provides, is required. The info about total num of VarInt-s will be lost, the info about total bytes - not. So it will look like a Sequence { elements: "u8".to_string() } with respect to schema.

It's about the same with rust's String at the moment. A String is essentially a Vec<VarInt>. It's serialized as Vec<u8> with info about total characters lost in serialized form, and having a "string" Declaration for itself and empty Definition. (second option in comment )

Similarly to String, one can define a type VarintsVec(Vec<VarInt>), serialize and deserialize the contents as Vec<u8>, with error checking during deserialization (about the lengths of encountered varints), and define BorshSchema as special "varint_vector" Declaration and empty Definition.

A SmallVec type will on average be 127 bytes long (with minimal nonzero length of a type defined as 1 byte according to #209 ), and defining header_size field in Definition::Sequence for the gain of 3 bytes less spent on header of an average ~120 bytes payload doesn't appear a big gain compared to just using Vec.

mina86 commented 1 year ago

It's about the same with rust's String at the moment. A String is essentially a Vec<VarInt>. It's serialized as Vec<u8> with info about total characters lost in serialized form, and having a string Declaration for itself and empty Definition.

That’s not quite the same though. In String case, I can deserialise Vec<u8> and then convert it with no additional allocations to String. With Vec<VarInt> I’d have to first deserialise Vec<u8> and then allocate a new (say) Vec<VarInt<u32>>.

However, this is a bit besides the point. Of course, I can always write serialisation which can be described by BorshSchema. The question is what to do when serialisation I’m using cannot be described by BorschSchema.