Textual representation of the wire format of a type (recursively)

ia0 commented 1 month ago

I would like to have a textual representation of the wire format of a type. I would use it as a generated file in my repository during reviews. The format could look like an annotated grammar of the wire format. Here is an example of the workflow I have in mind:

Let's assume a crate foo in a repository, that contains a few serializable types like Foo and Bar.
Next to that crate, I would have some foo.postcard file containing a textual representation of Foo in wire format (essentially the language recognized by deserialization, which is a bit more than the one produced by serialization because postcard is not canonical).
I have a CI test to make sure that file is in sync with the code.
During review, if that CI test is green, then I can review the postcard file to see how the wire format is changing, and if I consider it an acceptable change or not (and also how it impacts versioning).

The textual representation can simply be some annotated grammar of the language recognized by postcard for the type. For example:

enum Foo { A(FooA), B { b1: FooB1, b2: FooB2 } }
struct Bar { a: BarA, b: BarB }

Would become something like this:

Foo |=
| A=0x00 FooA
| B=0x01 b1=FooB1 b2=FooB2

Bar &=
& a=BarA
& b=BarB

Note how named things are prefixed with name=. Those annotations do not affect the wire format (the language recognized). Terminals are just bytes (0x00 to 0xff). Identifiers (Rust paths) are non-terminals unless they are a name. Also note the difference between "unions" using |= and "sequences" using &= (the symbol is repeated on each line to support empty unions and sequences). Ideally the file would recursively contain all definitions (here FooA, FooB1, etc).

In my case, I would consider the following change acceptable (it forgets a name, thus doesn't affect the wire format):

-enum Foo { A(FooA), B { b1: FooB1, b2: FooB2 } }
+enum Foo { A(FooA), B(FooB1, FooB2) }

Would result in the following diff:

Foo |=
| A=0x00 FooA
-| B=0x01 b1=FooB1 b2=FooB2
+| B=0x01 FooB1 FooB2

I would also consider the following diff acceptable (when deprecating a variant):

-enum Foo { A(FooA), B { b1: FooB1, b2: FooB2 } }
+enum Foo { _A(Infaillible), B { b1: FooB1, b2: FooB2 } }

Foo |=
-| A=0x00 FooA
+| _A=0x00 Infaillible
| B=0x01 b1=FooB1 b2=FooB2

Infaillible |=

I suspect the experimental "schema" feature could be useful, however I see at least 2 problems:

One that I'm trying to fix in #142.
The fact that schema is not exactly the wire format, but something a bit more high-level. It's possible to write the function as a user, but that would encode a bit of the postcard format logic in user code. I think it would be better for postcard to directly provide a wire-level description like this:

pub type WireSchema = Named<WireRule>; // name is the type name (could be required)
pub struct Named<T> {
    name: Option<&'static str>,
    object: T,
}
pub enum WireRule {
    Union(&'static [Named<WireSequence>]), // name is the variant name (could be required)
    Sequence(WireSequence),
}
pub type WireSequence = &'static [Named<WireToken>]; // name is the field name (optional)
pub enum WireToken {
    Varint(Varint),
    U8, I8, F32, F64, // notice that Char is missing (it's defined)
    Seq(WireSequence), // length-encoded sequence, pretty-printed as `n*(...)`
    Constant(u8),
    Schema(&'static str),
}

Note that maps are really just n*(k v) where k and v are schema names. And byte sequences are just n*u8. By definition of Seq, n is varint(usize).

jamesmunns commented 1 month ago

Hey @ia0, I'm definitely open to a separate tool that consumes the Schema information and outputs some consistent grammar or other reviewable output. If you do this, please feel free to open a PR to the README to link to it, and I would be open to potentially upstreaming it in the future.

At the moment I believe you could print with Debug, or serialize the schema to some format, such as JSON, and you could check that with something like insta, but I am open to a more purpose built tool.

I think this would be a useful stepping stone towards also generating ser/de impls in languages other than Rust (and not with Serde) in the future potentially.

ia0 commented 1 month ago

I'll try to get something on my side first, since the wire format is stable (and I don't expect it to change on the parts that I'm using, I'm not using char so I'm not worried about #101). If I'm satisfied with what I have and believe it makes sense to be part of postcard, I'll update this issue.

jamesmunns commented 1 month ago

For what it's worth, I plan to release postcard 2.0 soon, BUT I plan to keep the 1.0 wire format, e.g. I won't address #101 in postcard 2.0, but rather tat the next breaking wire format.

I'm definitely interested in seeing what you build! Part of the intent of the Schema derive was to be able to do these kinds of things, I just hadn't gotten to it yet.

ia0 commented 1 month ago

I went with this wire representation and convert from SdmTy here. Here is what the output looks like:

% cd crates/protocol
% cargo run --example=schema --features=_schema
DeviceError=0: {} -> (space:u8 code:u16)
AppletRequest=1: (applet_id:{Default:(0:u32)} request:(n:usize u8^n)) -> ()
AppletResponse=2: {Default:(0:u32)} -> (response:{None:(0:u32) Some:(1:u32 (n:usize u8^n))})
PlatformReboot=3: () -> {}
AppletTunnel=4: (applet_id:{Default:(0:u32)} delimiter:(n:usize u8^n)) -> ()
PlatformInfo=5: () -> (serial:(n:usize u8^n) version:(n:usize u8^n))
PlatformVendor=6: (n:usize u8^n) -> (n:usize u8^n)

The general format is <variant>=<discriminant>: <request> -> <response> where only request and response are a pretty-print of the wire format. The pretty print uses parentheses for concatenation and curly braces for disjoint union (using a u32 discriminant). Names are optional and prefixed as <name>:. The special (n:usize <wire>^n) is a length-encoded array. Maybe I'll use [<wire>] instead.

jamesmunns / postcard

Textual representation of the wire format of a type (recursively) #143