matrix-org / matrix-spec

The Matrix protocol specification
Apache License 2.0
197 stars 97 forks source link

[MSC stub] Tagging binary data #816

Open ShadowJonathan opened 3 years ago

ShadowJonathan commented 3 years ago

(Dumping my thoughts here of what i discussed with someone else in a room)

Proposal

This idea is for a matrix-"canonical" way of tagging binary data in a native-but-compatible form.

Currently in JSON you'd encode some binary data as base64, but in CBOR, you might want to have this be "raw binary" (as is allowed in that spec)

Basically, for JSON then, you can use an object with 1 key inside of it, where the key is prefixed with _, and whatever after notes the data encoding.

Example formats could be;

_b32 (base 32) _b64 (base 64) _0x (hexadecimal) _0o (octal) _0b (binary) _u (as-is UTF-8 data, but recognised as binary)

This formatting method is more or less (partially) inspired by Python-esc and Rust-esc prefixing (123i32, 0xDEADBEEF, u"hi, this is unicode")

This would appear as following:

Take the following abstract object;

{
  "key": <binary>
}

this could then be JSON-encoded into

{
  "key": {"_b64": "<base64 encoded binary>"}
}

The encoder could select an algorithm that makes most sense, maybe the data is more easily packed as b64, maybe as b32, maybe as hexadecimal, or maybe it just needs to be UTF-8.

CBOR and some other formats might represent this data as "native" binary.

Reasoning

The reason for this canonical binary format is to allow a "Binary" type in the spec, this could open it up for a whole array of functionality, be it interests from the foundation, or outside, where binary blobs could be encoded in an easily-formattable way, if this is "canonicalised", deserializers could allow to pick up on this for any array of programming languages that wish to interface with matrix, in a uniform way.

The different encodings allow different "efficient" methods of encoding that data, and the object-wrapping allows the key-"value" abstraction to make deserializing/serializing efficient across a myriad of programming language.

Potential problems

On its surface, this does not preserve "roundtrip" information, an deserialized object with binary data might be serialized differently, so if roundtripping information is to be preserved, languages would have to add an extra "tag" to the binary data that tells which encoding it used when read, this is only required when the application, at that moment, has an interest in preserving this information, if the data is only intended to be consumed, it could be "lossily" read.

ShadowJonathan commented 3 years ago

Note: I'm making this a spec idea instead of a full-fledged MSC because i think this is a solution without a usecase, while it is nicely thought-out, i don't particularly know where this could fit in the current spec, or what purpose it'd serve.

Still, I'd like to publish the idea.