Add Definition::Enum::tag_width field

mina86 commented 12 months ago

The field allows specifying how many bytes the discriminant byte takes. This in turn allows for better support of custom encoding formats with more than 256 variants and untagged unions.

    Enum {
        /// Width in bytes of the discriminant tag.
        ///
        /// Zero indicates this is an untagged union.  In standard borsh
        /// encoding this is one however custom encoding formats may use larger
        /// width if they need to encode more than 256 variants.
        ///
        /// Note: This definition must not be used if the tag is not encoded
        /// using little-endian format.
        tag_width: u8,

        /// Possible variants of the enumeration.
        variants: Vec<(VariantName, Declaration)>,
    },

relates to: https://github.com/near/borsh-rs/issues/181

mina86 commented 12 months ago

Another idea is to introduce a Width type:

pub enum Width {
    NoWidth,
    LE8 = 1,
    LE16 = 2,
    LE24 = 3,
    LE32 = 4,
    LE40 = 5,
    LE48 = 6,
    LE56 = 7,
    LE64 = 8,
}

which would specify how the tag is encoded. That would leave possibility for extensions in the future.

dj8yfo commented 12 months ago

Offtopic:

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },
    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },

will be better than

     Tuple {
        /// Types of elements of the tuple.
        elements: Vec<(Option<FieldName>, Declaration)>,
    },

as the latter allows a mix of named and unnamed fields.

dj8yfo commented 12 months ago

Please update branch with new changes in trunk as well as (periodically) other prs. Occasionally compile or other errors occur despite a pr being able to be merged cleanly.

dj8yfo commented 12 months ago

@mina86 the change to Definition, proposed in comment is a little bit inconsistent with respect to mixing information about serialized form and the rust syntax forms that serialized representation originated from.

Let's take Enum first as an example. Current version in master contains this:

    Enum {
        variants: Vec<(VariantName, Declaration)>,
    },

This pr adds some info about serialized form:

    Enum {
        tag_width: u8,
        variants: Vec<(VariantName, Declaration)>,
    },

But, some info was skipped. With the addition of enum discriminants the enum tag in serialized form for VariantName cannot be computed just by its index in containing Vec<...>, so the full definition, containing both full info about serialized form and VariantName-s, which are relevant only for hinting, what the original enum in rust looked like, would look like:

    Enum {
        tag_width: u8,
        variants: Vec<(u64, VariantName, Declaration)>,
    },

If it were supposed to only convey information about serialized form, than it would look like:

    Enum {
        tag_width: u8,
        variants: BTreeMap<u64, Declaration>,    // the order of variants, as they were declared in rust enum,  
                                                                          // is irrelevant to serialization
    },

Similarly, this change for Structs, proposed in the same comment

    /// A fixed-size tuple with the length known at the compile time and the
    /// elements of different types.
    Tuple {
        /// Types of elements of the tuple.
        elements: Vec<Declaration>,
    },
    /// A structure, structurally similar to a tuple.
    Struct {
        /// Named fields of the structure with their types.
        ///
        /// Tuple-like structures are modelled as a `Tuple`.
        fields: Vec<(FieldName, Declaration)>,
    },

contains some loss of information about originating forms in rust, as some Tuple-s originated from tuples in rust, and some - from tuple structs, that info is lost, while structs with named fields don't lose that kind of information.

If it were supposed to only convey information about serialized form, than it would look like this both for tuples and structs on rust side:

    Tuple { elements: Vec<Declaration> },

as the names of the fields are irrelevant.

A variant with full info (both serialization & origin) is what it looks like now:

    Tuple { elements: Vec<Declaration> },
    Struct { fields: Fields },
    ...
pub enum Fields {
    NamedFields(Vec<(FieldName, Declaration)>),
    UnnamedFields(Vec<Declaration>),
    Empty,
}

thus, i believe, to keep up the same logic about the Definition as a whole, it should be either:

// max info
pub enum Definition {
    Sequence {    // extends prev `Sequence` and replaces `Array`
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },
    Tuple {    // stays the same
        elements: Vec<Declaration>,
    },
    Enum {  // extended with `tag_width` and `u64` added to variants tuple
        tag_width: u8,
        variants: Vec<(u64, VariantName, Declaration)>,
    },
    Struct { fields: Fields },  // stays the same
}

pub enum Fields {
    NamedFields(Vec<(FieldName, Declaration)>),
    UnnamedFields(Vec<Declaration>),
    Empty,
}

or

// only serialized form
pub enum Definition {
    Sequence {  // changes
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },
    Tuple {  // stays the same, replaces `Struct`
        elements: Vec<Declaration>,
    },
    Enum {    // changes
        tag_width: u8,
        variants: BTreeMap<u64, Declaration>,    // the order of variants, as they were declared in rust enum,  
    }
}

@frol what do you think?

As there's also process of deserialization, which turns serialized bytes back into rust forms, i vote for variant with max info, where adding u64 to Vec<(u64, VariantName, Declaration)> would be the most complex change.

On the other hand, if BorshSchema's job is to only concisely describe the format of output data, and the task of implementing actual conversions is delegated to BorshSerialize and BorshDeserialize traits, then the minimal variant would be better.

mina86 commented 12 months ago

@mina86 the change to Definition, proposed in comment is a little bit inconsistent with respect to mixing information about serialized form and the rust syntax forms that serialized representation originated from.

Note that one more piece of information one has to go on is declaration. Tuples are called Tuple<...> which can be used to distinguish a tuple from a struct. (Though as a side note I’d argue for replacing the declaration by (...) to not clash with someone creating a generic struct called Tuple). Even if it’s important to keep information about the rust types, I don’t think Fields is necessary for it. (I’d be happy removing all information not related to encoding but I don’t have strong feelings).

Lastly, Enum::tag_width is being added in both situations so can we get this PR merged regardless? ;)

frol commented 12 months ago

@dj8yfo I don't have full context about Borsh Schema, and I still believe there are more problems with it than it solves, so anything that kind of works is fine with me. I am fine with iterative approach and this PR only adds tag_width, so we don't need to block it.

dj8yfo commented 12 months ago

@mina86 with all of that articulated,

so can we get this PR merged regardless?

yes i'd be happy to see

    Sequence {    // extends prev `Sequence` and replaces `Array`
        length_width: u8,
        min_length: u64,
        max_length: u64,
        elements: Declaration,
    },

in another pr.

If you also considered adding info about enum discriminants' values (in this or another pr), that would be consistent with current pr's drive for adding details about data layout:

    Enum {  // extended with `tag_width` and `u128` added to variants tuple
        tag_width: u8,
        variants: Vec<(u128, VariantName, Declaration)>,
    },

The change of this pr might be useful not in the context of having enums defined with more than 256 variants, but having enums defined with reasonable number of variants which have discriminants outside of u8 range.

Tuples are called Tuple<...>

Due to the clash of names, i think there's no need to introduce any change to Definition::{Struct, Tuple} , it looks more like a change for the sake of a change, not adding anything new.

mina86 commented 12 months ago

If you also considered adding info about enum discriminants' values (in this or another pr), that would be consistent with current pr's drive for adding details about data layout:

I can look into it, but I’m essentially doing all this in my spare time so I give no promises that I’ll add discriminants to the definition.

dj8yfo commented 12 months ago

it might actually be

    Enum {  // extended with `tag_width` and `i64` added to variants tuple
        tag_width: u8,
        variants: Vec<(i64, VariantName, Declaration)>,
    },

as rust allows isize , if BorshSchema were to mirror what rust specifies for discriminants

I give no promises that I’ll add discriminants to the definition

no, i just meant it would be a more meaningful investment of time for this repo than churning Struct/Tuple again

mina86 commented 12 months ago

no, i just meant it would be a more meaningful investment of time for this repo than churning Struct/Tuple again

Difference is I already have Struct/Tuple mostly written. And adding discriminant to Enum seems like a huge can of worms.

dj8yfo commented 12 months ago

Difference is I already have Struct/Tuple mostly written

well, then you should be ready it gets rejected if it doesn't look absolutely stunning )

dj8yfo commented 12 months ago

seems like a huge can of worms.

adding #[borsh(tag_type = "i16")] for derives would be, just adding the discriminants to variants' tuples shouldn't be too hard

mina86 commented 12 months ago

I think I don’t understand GitHub… I’m trying to comment on your changes but my comment don’t seem to appear.

Firstly, why is actually dynamically-sized sequence of zero-sized encoding forbidden?

Secondly, rather than calling is_zero_size it’s probably better to check sz == 0 not to duplicate the work.

dj8yfo commented 12 months ago

I’m trying to comment on your changes but my comment don’t seem to appear.

there must be this button in upper right corner Finish your review

near / borsh-rs

Add Definition::Enum::tag_width field #215