Grammar for completely opaque IDs (SPEC-388)

matrixbot commented 8 years ago

"Grammar" might be too strong a word, but we should probably make explicit that the following IDs are entirely implementation-specific byte sequences. The originators are allowed to create them however they like, and the recipient has to send them back as they arrived.

~~Call IDs (as exposed in m.call... events)~~ fixed by MSC2746; now specced at https://spec.matrix.org/v1.7/client-server-api/#grammar-for-voip-ids
Filter IDs (as returned by POST /user/$id/filter)
Media IDs (from mxc:// URIs) (https://github.com/matrix-org/matrix-spec/issues/503)
Session IDs (as used in the UIA API)
Transaction IDs (as used in /send and other transactional PUT endpoints)
Device IDs (as used in the device API and others)
Message IDs (as used in the store-and-forward messaging server API)
Key IDs (as used in the federation protocol)

(Imported from https://matrix.org/jira/browse/SPEC-388)

(Reported by @richvdh)

matrixbot commented 8 years ago

Jira watchers: @richvdh

matrixbot commented 8 years ago

Links exported from Jira:

relates to SPEC-1

matrixbot commented 8 years ago

Hrm; there are encoding difficulties here.

Some of these IDs end up in JSON strings, which means that they must be interpreted as a sequence of unicode characters - they are not just byte sequences. Likewise, because our URIs are %-encoded UTF-8, having opaque byte sequences in our URIs would require part of a URI to be parsed as UTF-8, and part as 8-bit data, which most URI parsers would not be happy with.

As I see it there are two options here:

Allow any unicode characters in these IDs, which puts the onus on recipients to correctly handle unicode characters - for instance, a client would need to parse UTF-16 \uXXXX sequences in the JSON response to POST /user/$id/filter, and then encode it as %-encoded UTF-8 in subsequent URI parameters.
Restrict to a common set of ASCII, which puts the onus on originators to make sure that they aren't generating other characters.

Postel's law should guide us here. My inclination is to restrict these IDs to unreserved URI characters (ie, \[A-Za-z0-9._~-]: see RFC3986) - but also to recommend that, if you receive such an ID, you parse it as a unicode string and re-encode it correctly when sending it on. This has the advantage that if you're writing a hacky bash script, you don't need to worry about escaping at all, whilst those creating IDs can still use base-64 to encode whatever they want.

-- @richvdh

matrixbot commented 8 years ago

* is used as a wildcard for device id, so must be forbidden as a device id.

-- @richvdh

richvdh commented 3 years ago

Since the links are hard to find above:

Proposals:

MXC1597 contains our current best proposal for this in general
MXC2746 includes proposals for call IDs

Other tracking issues:

https://github.com/matrix-org/matrix-doc/issues/1514 is a parent epic issue for all things Grammar
https://github.com/matrix-org/matrix-doc/issues/2177 focusses specifically on media IDs
matrix-org/matrix-spec#593 focusses specifically on Filter IDs
https://github.com/matrix-org/matrix-doc/issues/2568 focusses specifically on access tokens

matrix-org / matrix-spec

Grammar for completely opaque IDs (SPEC-388) #174