ietf-wg-mimi / draft-ietf-mimi-content

6 stars 4 forks source link

MIMI Message Syntax: Matrix's Extensible Events #5

Open anoadragon453 opened 1 year ago

anoadragon453 commented 1 year ago

As a whole, the semantics laid out in the adopted content format draft look sound to me. At this point I'd like to propose a syntax for how messages could concretely look.

"Matrix's Extensible Events" is one possible syntax. It is a content format we've been developing as part of the Matrix Foundation C.I.C., and are actively looking to move the Matrix network over to in the medium term. The format is flexible, with re-usable and nestable pieces and supports clear fallback semantics.

Below is a quick overview of Matrix's Extensible Events and some examples mapping the current draft's content format semantics to it. I will be using JSON to represent these structures, but one could easily use any message format that supports similar semantics to JSON (such as CBOR (RFC 8949), MessagePack, Protobuf, etc.).

Overview of Matrix's Extensible Events

A typical "extensible event" looks like this:

{
    "type": "m.some.type",
    "content": {
        <content blocks>
    },
    "sender": "@andrewm:example.org",
    "event_id": "$ct3Os0acx2nUKFTmbedS8MhPT1W1J8c2cMWyWEI0ADw",
    "room_id": "!cEXuEjziVcCzbxbqmN:example.org"
}

Content Blocks

A "content block" is identified by an identifier and some renderable data. For example, this is a message of type m.message with a m.text content block:

{
    "type": "m.message",
    "content": {
        "m.text": [
            {"body": "<i>Hello World!</i>", "mimetype": "text/html"},
            {"body": "Hello world!"}
        ]
    }
}

Content blocks are simply any top-level key in the content field. They can have any value type (that is also legal in a message generally: string, integer, etc).

The m.text content block is defined as an ordered array of MIME types, optional language codes and strings that together represent a single marked-up blob of text. The mimetype is optional and defaults to text/plain;charset=utf-8. The m.message type requires a m.text content block, and defines no optional content blocks.

A useful property of the m.message message type and the m.text content block being defined separately is that we can reuse the latter for other message types. We can also use it when attempting to specify a rendering fallback for message types some clients may not understand.

Rendering & Fallback

When a client receives a message, it first looks at the type of the message. If it is a type that the client recognises (i.e. it supports a version of MIMI that defines the type, or a custom vendor type) then they can continue to look for the content blocks that type defines should be rendered.

If the client does not recognise the type, then it should proceed to look at the content blocks that the message contains and attempt to associate them with a message type it does know.

That culminates our fallback semantics. For example, given the following message with a custom type:

{
    "type": "com.large-vendor.location-pin",
    "content": {
        "com.large-vendor.location": {
            "uri": "geo:51.5008,0.1247;u=35",
            "description": "My current location",
            "zoom_level": 15
        }
        "m.text": [{"body": "My current location is geo:51.5008,0.1247;u=35"}]
    },
    // extra fields omitted...
}

This message has a com.large-vendor.location content block in it, which contains a location defined by a geo: URI (RFC 5870). The name of the content block is namespaced to the vendor, allowing other vendors/users to experiment with similar content block designs without clients misinterpreting one for another.

Imagine you have a room with two clients in it; Large Vendor App (LVA) and Other Large Vendor App (OLVA). LVA will recognise a com.large-vendor.location-pin message type and render it accordingly. While OLVA may not recognise that type. In the case of OLVA, it should fall back to the other available content blocks and attempt to render a type that it does know. The priority ordering of types to try is up to the implementation.

Other Large Vendor App knows that m.message requires a single m.text content block, so it will render this message as an m.message with the text "My current location is geo:51.5008,0.1247;u=35". These fallback semantics allow vendors, hobbiests and anyone else to send messages while remaining confident that other clients in the room will still be able to understand their communication.

Nesting

Along with other similarities to a NestablePart, content blocks can be nested inside each other.

The m.image message type defines a m.file content block, optional m.thumbnail, m.caption and an m.text block (for fallback) among other blocks:

{
  "type": "m.image",
  "content": {
    "m.text": [ // required - fallback for text-only clients
      {"body": "my-dog.png (127 KB) https://example.org/img/my-dog.png"}
    ],
    "m.file": { // required - the image itself
      "url": "https://example.org/img/my-dog.png",
      "name": "my-dog.png",
      "mimetype": "image/png",
      "size": 130530
    },
    "m.thumbnail": [ // optional
      {
        // A thumbnail is an m.file+m.image, or a small image
        "m.file": {
          "url": "https://example.org/thumb/my-dog.png",
          "mimetype": "image/jpeg",
          "size": 1702
        },
        // thumbnail image dimensions omitted.
      }
    ]
    "m.caption": { // optional - goes above/below image
      "m.text": [{"body": "And this is the 568th photo of my dog!"}]
    }
    // other optional content blocks (alt text, image dimensions, etc.) omitted.
  }
}

As m.text is quite a versatile content block, we can use it in many areas, even within other content blocks. The same is done for m.file, which is being used by m.thumbnail to point to a smaller resolution version of the image.

These content block should not be used just anywhere however - message types must define required or optional content blocks that clients should expect when rendering them. There is another kind of content block that can be used at the top-level of any message type though, called mixins.

Mixin Content Blocks

A "mixin" is a class of content block that can be added to any message type. An example of a mixin is a m.automated content block - for which the presence of such denotes that the message was sent by an automation rather than by a user directly. Slack, for example, uses this information to display an APP label next to bot messages.

Another type of mixin is m.thread - informing the client that a given message took place in a thread. This saves message types from all needing to define m.thread as an optional content block. Instead, it can be applied to any message type.

References

There are several scenarios where a message needs to reference another message. Replies, reactions and deletions are a few examples. A m.references content block is an array of zero or more "references"; each containing the type of reference and the ID of another message.

For instance, to delete a message a client would send the following:

{
    "type": "m.delete",
    "m.references": [
        {
            "type": "m.delete",
            "message_id": "message_id@domain"
        }
    ]
}

Receiving clients would interpret this as a past message with ID message_id@domain should be removed from the timeline.


Using this system, we can translate all of the currently defined functionality (reactions, expiring, VoIP, etc.) to concrete representations.

We've already proposed a number of definitions for various message types (listed in the introduction of MSC1767), which can serve as a starting point.

The semantics are very similar to the current content format draft. The main difference is that the receiving implementation is in full control of which content blocks it decides to process and in which order. This was done deliberately to allow for the fallback semantics, while still hinting at what content block should be rendered via the defined message type.

mar-v-in commented 1 year ago

Not sure if this is the right discussion venue for this, but I'll put my feedback here anyway ;)

The event_id field indicates the message ID, which is unique across all rooms. This is analogous to the MessageId field in the current draft.

Except that the current draft's messageId consists of two parts: a domain-unique message id and the domain.

The type field states the type of the message. This tells the receiver what kind of content (in content) to expect. This is not quite analogous to the current draft, but is similar to a combination of the "PartSemantics" and "Disposition" mechanisms we currently define.

Except that the part semantics / disposition are per part and thus allow a message to contain arbitrary multiple contents (via singleUnit/processAll semantics). It also allows for much more flexibility as e.g. a file can be marked for inline display (looking at the MSC1767 it seems to be impossible to create a new type that fallbacks to what type=m.image does, because m.image is not a content block, the behavior of m.file does not ask for the file to be displayed inline and there is no disposition field to indicate that)

When a client receives a message, it first looks at the type of the message. If it is a type that the client recognise [...] then they can continue to look for the content blocks that type defines should be rendered. If the client does not recognise the type, then it should proceed to look at the content blocks that the message contains and attempt to associate them with a message type it does know.

So essentially, the type doesn't really matter, because the receiving client will look into the content blocks nonetheless and pick any that it does support (e.g. in your example, a client that understands com.large-vendor.location content block will render that, others will render the text - even if type was any)

The semantics are very similar to the current content format draft. The main difference is that the receiving implementation is in full control of which content blocks it decides to process and in which order. This was done deliberately to allow for the fallback semantics, while still hinting at what content block should be rendered via the defined message type.

The current draft specifies that the order of entries in the chooseOne semantics indicate the senders preference over the different parts, but that recipient is allowed to pick any. This means the receiving entity remains in full control, but the sender's order gives essentially the feature you provide through the message type.

In summary: the current draft seems to be more powerful than the MSC1767 syntax and be more targeted to the usecase of MIMI.

When asked for the format, I would suggest to go for something like EXI rather than JSON/CBOR/MessagePack/...: It is well extensible, has support for binary data, is generally efficient in encoding, has a reasonable plain text representation (can be converted from/to XML lossless) and also has a logic for string lookup tables, the latter could become relevant, as with the multipart system, we might see the same longer strings (like URLs) appear at multiple parts within the message.

anoadragon453 commented 1 year ago

The event_id field indicates the message ID, which is unique across all rooms. This is analogous to the MessageId field in the current draft.

Except that the current draft's messageId consists of two parts: a domain-unique message id and the domain.

Yes, I was confused between the ID of an event at the transport layer (what the "Event ID" in my proposal referred to), and a message ID inside the MLS application message. Depending on how the transport layer develops, these two IDs may end up being the same value, but for now that's up in the air (blocked on the transport side).

In Matrix the Event ID is derived from a hash of some of the fields in the message, which include the domain name. Thus the domain name need not be included directly in the ID.

Regardless, that change should be discussed separately. I'll update the post to mention the difference as pointed out.

The type field states the type of the message. This tells the receiver what kind of content (in content) to expect. This is not quite analogous to the current draft, but is similar to a combination of the "PartSemantics" and "Disposition" mechanisms we currently define.

It also allows for much more flexibility as e.g. a file can be marked for inline display

Just to make sure i understand this correctly, would this translate to e.g. an ODT file being rendered inside the timeline of the room? That might be a case of the receiving client deciding whether to do so for an m.file based on the mimetype?

(looking at the MSC1767 it seems to be impossible to create a new type that fallbacks to what type=m.image does, because m.image is not a content block, the behavior of m.file does not ask for the file to be displayed inline and there is no disposition field to indicate that)

If all the necessary content blocks of an m.image type are present (m.file, m.text), then ideally you would be able to fall back. The problem I see is that there is already an m.file there being used for the video. One solution to solve this is to use an m.image_details content block, with an m.file nested inside of it. Then you would not have a conflict between two top-level m.file content blocks. Still, that conflict feels like a valid limitation. The current draft gets around this by nesting everything.

When a client receives a message, it first looks at the type of the message. If it is a type that the client recognise [...] then they can continue to look for the content blocks that type defines should be rendered. If the client does not recognise the type, then it should proceed to look at the content blocks that the message contains and attempt to associate them with a message type it does know.

So essentially, the type doesn't really matter, because the receiving client will look into the content blocks nonetheless and pick any that it does support (e.g. in your example, a client that understands com.large-vendor.location content block will render that, others will render the text - even if type was any)

Types do help in the case where you have two or separate events which share the same set of required content blocks, but that you'd want to render differently. For instance, the types m.video, m.image, m.audio and m.file all only require the m.file and m.text content blocks. The type informs the receiving client what the sender intended to have rendered. Of course if a client didn't understand m.audio, it could fall back to m.file (just displaying a file download) or even just m.text.

Now m.file does have a (optional) mimetype field, and a receiving client could use this instead to determine whether to show a video player/audio player/etc. This is essentially the route that the current draft proposal takes. But it does mean the client needs to use heuristics such as checking against a list of mimetypes in order to know what content to render, rather than an explicit m.video vs. m.audio type. We'd need to make similar exceptions (with different fields to check) for other groups of types that require the same content blocks.

It's also important to note that its up to the receiving client to decide the known types they would and would not fall back to. For instance, I suspect clients will not want to fallback to a poll type.

The semantics are very similar to the current content format draft. The main difference is that the receiving implementation is in full control of which content blocks it decides to process and in which order. This was done deliberately to allow for the fallback semantics, while still hinting at what content block should be rendered via the defined message type.

The current draft specifies that the order of entries in the chooseOne semantics indicate the senders preference over the different parts, but that recipient is allowed to pick any. This means the receiving entity remains in full control, but the sender's order gives essentially the feature you provide through the message type.

Good point! Looking at this again, I don't think there's any inherit benefit over the current draft in terms of priority of fallbacks in an event.

When asked for the format, I would suggest to go for something like EXI rather than JSON/CBOR/MessagePack/...: It is well extensible, has support for binary data, is generally efficient in encoding, has a reasonable plain text representation (can be converted from/to XML lossless) and also has a logic for string lookup tables, the latter could become relevant, as with the multipart system, we might see the same longer strings (like URLs) appear at multiple parts within the message.

Interesting, thank you. I've not heard of this format before - I'll need to take a look!

mar-v-in commented 1 year ago

Just to make sure i understand this correctly, would this translate to e.g. an ODT file being rendered inside the timeline of the room? That might be a case of the receiving client deciding whether to do so for an m.file based on the mimetype?

Types do help in the case where you have two or separate events which share the same set of required content blocks, but that you'd want to render differently. For instance, the types m.video, m.image, m.audio and m.file all only require the m.file and m.text content blocks. The type informs the receiving client what the sender intended to have rendered. Of course if a client didn't understand m.audio, it could fall back to m.file (just displaying a file download) or even just m.text.

Now m.file does have a (optional) mimetype field, and a receiving client could use this instead to determine whether to show a video player/audio player/etc. This is essentially the route that the current draft proposal takes. But it does mean the client needs to use heuristics such as checking against a list of mimetypes in order to know what content to render, rather than an explicit m.video vs. m.audio type. We'd need to make similar exceptions (with different fields to check) for other groups of types that require the same content blocks.

I fail to follow your reasoning here: