matrix-org / matrix-spec

The Matrix protocol specification
Apache License 2.0
188 stars 94 forks source link

New transport protocol (sunsetting JSON) #1736

Open Saiv46 opened 7 months ago

Saiv46 commented 7 months ago

Problem The Client-Server API is not bandwidth efficient enough. matrix-org/matrix-spec-proposals#3079 was supposed to help with that, but it has too broad scope and has the own issues. One of them is E2E — despite having CBOR as data format, encrypted messages are represented as Base64 string.

Suggestion I suggest allowing client and server to specify request and response media types as by RFC 7231 Section 5.3.2. This does not affect HTTP API and shouldn't require such changes.

Client and servers must set supported formats in Accept and Content-Type accordingly.

Accept: application/msgpack, example/vnd.elementx, application/json
Content-Type: application/msgpack

In this example I'm using MessagePack instead of CBOR, because its encoding is less complex (and wasn't a product of IETF bikeshedding), so it would provide better performance with the same bandwidth efficiency.

After some time, new proposal can be made to switch underlying E2E data encoding to most efficient one that are already being used, but only if MSC3079 would be useful enough to justify such changes.

richvdh commented 7 months ago

Related: https://github.com/matrix-org/matrix-spec/issues/1460

kegsay commented 6 months ago

I doubt JSON would ever be "sunsetted", it is too ubiquitous, human-readable and easy to use.

MessagePack and CBOR encodings are extremely similar, given they started out as one proposal before diverging. For "common" types they are effectively interchangable. For bandwidth optimisation, both are as efficient as each other. Encode the following and you'll see they are both 131 bytes:

{
  “name”: “Alice”,
  “age”: 42,
  “refs”: [11, 65540, null, 4294967300],
  “friends”: {
    “Bob”: “AABBCCDDEEFFGG”,
    “Carol”:” “HHIIJJKKLLMMNNOO” 
  }
  “x”: 0.1,
  “y”: null,
  “valid”: true
}

As for performance, I think it's not true to state "use MessagePack because its encoding is less complex, so it would provide better performance with the same bandwidth efficiency." Most benchmarks like https://prataprc.github.io/msgpack-vs-cbor.html show they are broadly equivalent, where most of the performance is due to the quality of the library rather than the protocol itself. Because there is so little to differentiate them, CBOR at least has a standardised specification as an RFC vs MessagePack which has a specification in github.

The problem with saying "just use MessagePack/CBOR" is that is actually doesn't compress very well when you're dealing with string keys and string values, because the strings are basically left alone when encoding. There's a reason why MSC3079 uses enum integer keys. Look at those 131 bytes and you'll see a lot is due to key names, which you can compress significantly if you swapped them out for integers. The longer the key name, the greater the benefit, and Matrix JSON objects often have long key names. You can't go all the way and omit the key names entirely ala Protocol Buffers either, because Matrix events need to be extensible (so you can't add keys not known ahead of time). Protocol Buffers does support extensible keys, but it's clunky (often just falling back to using JSON as a serialisation format) and you lose a lot of the things you would use Protocol Buffers for in the first place like static type checking.

tl;dr converting from JSON to MessagePack won't get you as much as you think, you need additional tweaks for it to be worthwhile, which is why MSC3079 exists. You could feasibly split up MSC3079 into "CBOR bits" and "everything else" and use the Accept header, which would definitely be a possible stepping stone for low bandwidth support in Matrix.