Open turt2live opened 1 year ago
The following four encoding options were mentioned in @rohan-wire's IETF 118 slides:
Part of our goals with MIMI is to create a protocol that is easy for people to implement in order to drive rapid adoption (and developer happiness!). A mature, well-tested encoding library ensures correctness and speed when deployed at scale.
Encoding schemes which do not include the structure as part of the encoded output will naturally have a small footprint. The downside is that the receiver must know the schema before decoding the payload.
This can cause issues in some areas regarding forwards-compatibility. For instance, if we have the following struct (taken from draft-ietf-mimi-content-01):
struct MimiContent {
MessageId messageId;
uint64 timestamp;
MessageId replaces;
Octets topicId;
uint32 expires;
ReplyToInfo inReplyTo;
NestablePart body;
};
and want to add a new field:
std::vector<MessageId> lastSeen;
...old clients will not be able to deserialise a MimiContent
struct with the new field. Adding new fields to top-level structs is likely to be an infrequent affair however. We can also scope these changes to a specific version of a chat room (similar to how Matrix room versions specify which new features can be used in a room) in order to require clients to update to support the new schemas before participating in a chat room that uses them.
I was initially more concerned about the extensibility of messaging content, as we would like to support vendors adding custom message types. This is also useful when testing new message types in real implementations before adding them to an upcoming MIMI protocol version.
However, we can get around this by encoding message content using a separate schema struct. The current draft semantics already cover this, where a SinglePart
looks like:
struct SinglePart {
String contentType; // An IANA media type {10}
Octets content; // The actual content
};
with content
being opaque to the top-level scheme. An client implementation can look at the contentType
and check whether it recognises it. If so, it uses the schema it has for that contentType
to decode content
. Otherwise, it does not attend to decode it. This is similar to a client looking at a type and refusing to parse the associated JSON body - it doesn't understand the type, so why do so?
Thus I don't believe using an encoding which requires a schema (Protobuf, TLS Presentation Language) is an issue.
CBOR, Protobuf and MessagePack all have types such as string
, map
, boolean
defined. TLS Presentation language does not define these, instead giving the user a number
, array
, enum
and struct
types to play with. You have the same functionality at the end, but the former are nicer to work with when designing a schema.
While not important for deserialising a binary format, this is important for MIMI as being able to verify the signature of an encoded message when received at the server level is a useful property. Encodings such as CBOR and MessagePack may struggle here, as ordering of fields is not canonical (you'll end up with different binary representations of the same data). Protobuf's docs also call this out as a shortcoming, stating "You cannot compare two messages for equality without fully parsing them". Parsing messages before verifying their contents came from the expected source is not efficient.
We ran into this problem in Matrix, which currently uses JSON, and solved it by defining Canonical JSON, thus requiring JSON fields be sorted before sending them over the wire, and when verifying them.
CBOR has a suggestion for canonicalisation in the spec, which we could mandate. The cbor-rust
library has implemented support for it.
MessagePack appears to have support for canonical field ordering in some libraries, but not others.
The spec for TLS Presentation Language does not mandate field order, so library support (hah) for it may be spotty. But if we need to write libraries anyway, then we can mandate a canonical version.
This is easy enough to test with a benchmark, but I suspect we should consider other merits initially. I don't believe any of these would be significantly faster or slower than the other.
A note: I did start to write a micro-benchmark in Python. TLS Presentation Language was the hardest to support as it has no readily-available library!
If I remember correctly, one of the considerations is that it must be defined either by the WG itself or elsewhere in the IETF.
Ideally elsewhere, if I remember correctly. While MIMI is specifying protocols for interoperability, it could be considered out of scope to also define an encoding specification.
Good point!
I've updated my comment above :)
Thanks for putting this together! A few comments:
Forward compatibility: The way MLS solves this is by including a protocol version field in the highest level message struct, as well as in the room state, which I think is what you suggest, too. If there should be a need to change structs independent of protocol versions, we could do this via per-room extensions, which would be part of what would be called a room version in Matrix terminology.
Availability of libraries: I can't speak for other languages, but there is the tls_codec
crate that implements the encoding of rust structs. It works quite well in OpenMLS and also (I believe) mls-rs.
Ergonomics: If we feel that it's useful, we should be able to extend the TLS presentation language to a certain degree. At least we've done so for MLS and it didn't turn out to be problematic.
Field order: The TLS presentation language does mandate field order. The way I read it, structs are just a convenient way of representing an (ordered) sequence of fields.
You already mentioned the requirement for fields to be ordered for signing and verification. At this point, I don't see an advantage in choosing something that is not ordered.
I don't have a strong preference for the TLS presentation language, but it's served us well so far in TLS and MLS, so I'm somewhat biased towards that option. Also, MIMI stacks will either already have TLS presentation language encoding/decoding in their stack, because they use MLS, or they will have it in their stack at some point when they make the switch from DR. If we find a better performing option without significant drawbacks, I'm happy to change my mind, though.
If I remember correctly, one of the considerations is that it must be defined either by the WG itself or elsewhere in the IETF.
It requires a "stable reference". Protobuf 3 probably does not qualify because it is still being changed. Protobuf 2 probably qualifies.
Cons of TLS: This is less an issue for the transport protocol than for the content format, but I was unable to locate a javascript TLS presentation language parser that seems to be actively maintained. The most obvious TLS library for js, forge still mentions Flash in its README and the last commit was in spring of 2022. Also, TLS unfortunately does not have typedefs, so when you define a type for reuse or clarity, you have to access it through the type.
Frankly, I think none of these 4 encodings is really focused on what we would need for a gatekeeper to gatekeeper implementation, which is parsing speed über alles. Protobuf base-128 encodes all its integers. CBOR shares its type and length in many cases in the same byte. These are efficiencies for compactness which are undesirable in the MIMI case.
Thanks Andrew for sending out a nice summary.
Andrew said:
Do we care about the encoding scheme being defined by an existing RFC? If this is a blocker, why are we considering options other than TLS Presentation Language or CBOR? Similarly, if this is a blocker, it severely limits our options. What would we need to do to use, say, Protobufs. Would we need to get the encoding spec into an I-D?
If we have a good reason to use non-IETF specs, we can as long as there is a stable reference. The key here is the requirements.
Is a canonical representation of the data on the wire a hard requirement? Yes, I think so. I want to point out that a solution/compromise to this problem in CBOR is to define an array of the mandatory types (so those fields are always in order), and then a map of extensions (which could be empty). This allows safe extensions with only limited impact on speed (you only look for an extension when you need one). You can probably do this in TLS if you juggle the fields hard enough, but it is not especially natural.
This is my first time posting in a MIMI discussion, so please bear with me if I write things that are wrong or irrelevant, but I would still like to share my opinion as a developer of distributed systems.
I would put a strong emphasis on using a serialization protocol that is defined by an RFC or another formal standard published by a recognized organism, which leaves CBOR and TLS presentation language (TLS-PL) as good options.
Given the scope of the task that MIMI aims for, I think we would all agree that making a protocol that can be upgraded and extended is a strong requirement. At first glance, CBOR looks like it would be a good option, as field names are serialized with the data, and new fields can therefore be added easily. However this is relatively dangerous and can break down when the interpretation of existing fields change, leading to the following rule : the context in which a message should be interpreted must be specified outside of the message itself. This can be done using protocol versions, but I believe this outcome can be achieved in a much more versatile way by standardizing individual message structures (the equivalent of protobuf schemas), and assigning IDs to these types in the RFCs, which would then be prepended to each individual message. Once a message type has been published with its corresponding schema and type identifier, it cannot be changed anymore, and updated/extended versions must be published as separate schemas with a different type identifier. This does not need to be coordinated on a global scale using protocol versions (only the attribution of non-overlapping type IDs has to be coordinated), and individual RFCs could be published with specific extensions or updated message types (with such a process, the RFC numbers could be embedded in type identifiers to ensure that they are non-overlapping). Implementations would be required to still support older message type (unless they are specifically made obsolete by a new RFC), and their semantics can be upgraded on-the-fly to newer protocol semantics. This is the solution I've converged to in the software I develop, albeit for internal data storage formats with support for migration, and not for a public interchange protocol.
As a consequence of using exterior type identifiers to determine the semantics of encoded messages, the value proposition of CBOR that encodes fields names is much less relevant and TLS-PL becomes a compelling option as well. Since both are standardized, I believe they are relatively equivalent in their merits, and the remaining distinctions are relatively minor. I'd argue that CBOR might be a better option due to seemingly better library availability for many mainstream programming languages, and semantics that are quite similar to JSON, which is known by everyone, so it might be a better choice to ease the implementation of the MIMI standard and help it to be deployed widely as soon as possible. While TLS-PL might have better compactness and encoding/decoding performance, I'm not sure this is is a very relevant argument, given that CBOR is already orders of magnitude faster than JSON, and there will be many costs at other places (storage access times, network delays, conversion to other formats including JSON in the various proprietary client-server APIs, ...).
Concerning the need for a canonical representation in order to be able to compute and verify signatures, I feel that this whole mess is fundamentally caused by misuse of serialization and signing primitives, and could be totally avoided by storing the signature outside and next to the serialized message, in a second layer of serialization. A signed message would always be a tuple (signature, data bytes), serialized in any relevant way (potentially embedded in a larger struct with other data), where the data bytes would always be a byte slice of the signed message, serialized previously in a separate step, and that would always be transmitted as a single unit next to its signature and never modified. Not respecting this principle when building serialization formats is the cause for needless headaches such as needing to determine what a "canonical representation" means, and also exposes to higher risk of security flaws by misuse of cryptographic primitives. Of course, there would be significant performance issues when doing this with text-based encoding formats such as JSON, but for binary encoding formats that support inline byte slices, this can be very efficiently implemented without the need for binary encoding/decoding such as base64 and excessive memory copies. In fact, this could even help achieve better performance as sub-serialized fields would not need to be deserialized in all cases, and in particular not when interpreting their content is not needed.
Hope this helps.
A comment on Protobuf: Some popular implementations do not fully support Protobuf 2, but only the subset of features that is also present in Protobuf 3. Groups being the most popular example. Also Protobuf 2 support is considered deprecated in some implementations, so specifying the use of Protobuf 2 at this point seems weird.
Both CBOR and TLS-PL seem to be reasonable options to me.
The WG came to consensus on CBOR for MIMI-content, and I believe we have consensus to continue with TLS encoding for MIMI-protocol. @tgeoghegan ?
I'm not sure we've had the transport debate yet as a working group. There's several open questions about whether HTTP, TLS-PL, etc are correct - the current stuff was presented as 'notional'.
Currently we use TLS-serialized structs.