Open jamesmunns opened 1 year ago
Also: If you think you might want this or want to try this out before it releases, feel free to sound off here as well. I'll keep you in the loop whenever I have something ready to try.
Here's some prior art that was shown for me (a different way of generating the schema): https://docs.rs/serde-reflection/latest/serde_reflection/
It's good to see their Serde Data Model types match fairly 1:1 with what we came up with.
Pros: It doesn't require a second derive Cons: It can't be built at compile time
This issue is more about what to do with that schema, but I should probably review their "Features and Limitations", as we will likely have similar constraints.
I'm not sure if the following thoughts are related, but here's what I faced recently:
What surprised me at first was that, even though postcard
uses serde
's derives, it doesn't support everything serde
does for other formats. I get that there's no guarantees that a format would implement all serde features, but there are some surprises (e.g. #[serde(untagged)]
will only error when deserializing, not when serializing).
This also means that we can only support the lowest common set of features for the formats we support (that's just json and postcard right now). Offering an untagged
enum for a client API feels nice for forward compatibility, but we can't use the same types since that's unsupported by postcard
.
Given there's no support for untagged
enums, unless we've used enums from the very beginning to version every input/output, we can't make our schema evolve. Even if we decided to move from a non-enum to an enum, this is breaking change.
That's because we store our data in a persistent store and retrieving it (deserializing it in the process) is not possible if the types have changed in any way.
This is the big reason why we need to version things: we can't add new fields, even if they have default values. I'm looking at alternatives formats that would at least allow that. For example speedy
supports default_on_eof
. That's likely not something postcard
could support because it is bound to serde
's API. It would have to either make a breaking change (though it's kind of an additive change, not sure if it should be considered breaking) or switch to non-serde.
For context, we're accepting client data as JSON, we're exchanging data between nodes and storing data as postcard
's format and we're also storing data in SQLite. When we add new columns in SQLite, we have to either give it a default value or make it nullable. This works well w/ new struct fields with #[serde(default)]
, but that doesn't work (adding fields at the end of a struct) with postcard
, forcing us to create a new "version" of our schema and adding a lot more logic to handle all versions.
Hey @jeromegn - thanks for the input, and particularly reminding me about the limitations around serde tunables like untagged
, flatten
, etc.
I don't know if I have any answers yet, but these are really good data points, so I appreciate it!
At the moment with the "schema on the side" approach, I do expect "deserializing with schema" to be slower than "deserializing without schema", just because it has to do more. I have no idea the order of magnitude increase tho.
For use cases in a database, it might be possible to do an "upgrade" approach, either as part of a migration or "update on access" to switch to the newest schema when you run into old schemas, but I don't have a great story around that yet. Mostly I don't want to make postcard "worse" for existing users who are fine with the current limitations, which means I'm sorta limited to either doing something "on the side", or to make a different library "inspired" by postcard which is more flexible, at some perf/size cost. This would bring it more in line with things like cbor or protobuf.
LoRaWan
If you're not familiar with lorawan and want to know the more about it, TTN has great docs. For this discussion all you really need to know is that data rates are low (~1-22kb/s), and devices communicate directly and exclusively with a gateway.
Currently, packets are either comply with the CayenneLPP spec, or are hand crafted on the device, and hand parsed by a "codec" on the gateway. The codec is, per that spec, written in JS. Cayenne is pretty nice, but if your application doesn't fit into it's mold then the fallback of hand crafting packets is pretty dire.
I see two ways to make this situation better. The first would be to extend the gateway to allow webasm binaries in addition to JS. This would be similar to how JS is used now, except you could use postcard as it currently exists and get around hand-crafting packets. But, that's an aside as far as this issue is concerned.
The other solution would be something like a self describing postcard. The gateway would still need to be extended. That's not a big problem though, because the de-facto[1] standard gateway, ChirpStack was recently rewritten in rust. Making adding this feature (once it exists in postcard) nearly trivial. This is similar to Cayenne, but much more flexible. The big downside is that, at least currently, the codec stores no state. The gateway does store state for each device though, so it might be possible to store the description there.
For this application, additional size cost is of the concern, but the perf cost is negligible. Lorawan devices don't typically uplink data more often than once a minute.
[1] The two biggest public networks are TTN and Helium. TTN uses chirpstack, and Helium is planning on moving to chirpstack. AFAIK, most ISPs that deploy a lorawan network also use chirpstack, but I don't know if that's always true.
A usecase we have in practice that was not mentioned here is that we also search for a way to just "hash" the schema in a cryptographic manner, so we don't necessarily want to understand the schema. In this way we can give postcard data a stronger typing and assert that postcard data has the semantics we expect. JSON is more advantageous in this sense because named fields give a little bit more guarantees towards the semantic of the data.
Thanks for the hint about the serde_reflection
crate! I did some more research and saw that supporting schemas in relation to serde has already been discussed a few times, e.g. see https://github.com/serde-rs/serde/issues/345 which is about proposing a generalized way to create schemas in serde. In https://github.com/serde-rs/serde/issues/1785#issuecomment-624493760 a few very interesting crates (also serde_reflection
) are mentioned, most notably schemars
.
I believe that this ultimately requires general support (not wanting to say "serde support"). But I believe serde should provide a way to walk across the AST of a serde structure. Protocol implementation can then provide a schema generator that infers a postcard schema, json schema or (for our usecase) a schema "hash". I recon that in combination with const_trait_impl
, see https://github.com/rust-lang/rust/issues/67792, this crate's MAX_SIZE
can also be implemented without a macro.
I understand that such a thing has not been accepted into serde because it is hard to get right. It could be a strategy to align this crates Schema
implementation with the implementation of schemars
, find common patterns and then hopefully bring these into serde
as a general concept.
@therealfrauholle for reference, the experimental schema capabilities of postcard here: https://docs.rs/postcard/latest/postcard/experimental/schema/index.html, DOES support Hash
(edit: on the generated schema field), and you likely could come up with your own cryptographic way of hashing the schema if the default hasher doesn't fit your needs.
edit: you could send this hash as part of the "header" or "ID" of a message type to ensure coherence.
The largest reason this hasn't stabilized yet is that I haven't decided whether the schema should hash for JUST "structural" typing or "structural AND nominal" typing.
As an example:
// A - base case
struct Example {
temp: f32,
humidity: f32,
}
// B - Type name changed
struct ExaMple {
temp: f32,
humidity: f32,
}
// C - fields reordered, but type sequence still the same
struct Example {
humidity: f32,
temp: f32,
}
// D - one field renamed, no semantic or structural change
struct Example {
temperature: f32,
humidity: f32,
}
Which of these structs should be "the same schema"? If we JUST use structural typing, they are ALL the same (basically: (f32, f32)
).
If we only look at nominal typing of the FIELDS, A + B would be equivalent, but none of the others are.
If we look at ALL nominal typing, NONE would be equivalent.
Chances are, the best option is to pick "nominal and structural of types and fields" as the default, but document how someone could implement something different.
TL;DR: I want to hear from you if you have ever needed postcard to do something ("something" is defined below) that it doesn't today.
Background
Postcard is generally very efficient on the wire, partly because it is not "self describing" - the messages themselves give no hint or expectation on how they are to be deserialized.
In optimal cases, where both sides of the communication are Rust, and use the same
serde
representation/type definition (e.g. - they share a common "types" crate that defines the wire types), this is great, and both sides understand each other.However there are some sub-optimal cases:
Today
I'm currently looking into ways it would be possible to augment postcard data with schema information, so the "sub-optimal" cases listed above could be handled.
To be clear - postcard's core format will not change.
Ideally this would be an "optional add-on" - something you can use contextually, sometimes even after the fact, to enable those suboptimal cases, or as "extra metadata" you could send either with every message, or "on first connection", or "on request", or whatever makes sense for your link budget.
If that isn't possible, I'd probably look into making this a second crate "inspired by postcard", which can be used when a little more overhead is worth the flexibility.
That being said - I'm trying not to focus as much on "how" to make this possible yet, and instead looking at "what is needed". Discussions of "how to do this" are out of scope for this issue's comments.
What I need
Instead of blindly implementing what I THINK would be useful (to me, at least), I'd like to hear from folks who have run into the sub-optimal cases above, or even ones that I didn't list above. This will help me make sure whatever I end up researching/implementing covers the actual needs/gaps in today's postcard.
Ideally, I'd like to keep this discussion public, but I am also willing to have a private chat via email or matrix (contact info on my profile, or ask here), and I am willing to sign/provide an MNDA to discuss any proprietary usage that might benefit from changes like the ones proposed.
Thank you!