matrix-org / matrix-spec

The Matrix protocol specification
Apache License 2.0
188 stars 94 forks source link

Acceptable characters for MXC opaque identifiers should be more clearly described #503

Open joepie91 opened 5 years ago

joepie91 commented 5 years ago

From the spec:

Content locations are represented as Matrix Content (MXC) URIs. They look like:

mxc://<server-name>/<media-id>

<server-name> : The name of the homeserver where this content originated, e.g. matrix.org
<media-id> : An opaque ID which identifies the content.

However, it is not specified anywhere (as far as I can tell) what the valid character range is for these opaque media-ids. In particular, it doesn't specify whether the media-id may contain slashes - a detail that's quite semantically important for a lot of HTTP request routing implementations, which often treat a "URL parameter" as a string of any characters other than a slash, eg. as in /media/:id where /media/foo/bar would not match.

richvdh commented 5 years ago

This is mentioned under #174, and an attempt to answer it is in the as-yet-unimplemented MSC1597: https://github.com/matrix-org/matrix-doc/blob/rav/proposals/id_grammar/proposals/1597-id-grammar.md#opaque-ids.

joepie91 commented 5 years ago

Whoops, my bad - I didn't realize that that included MXC identifiers as well. I guess this one can be closed, then?

richvdh commented 5 years ago

It's probably useful to keep it open as a specific place to discuss media ids

uhoreg commented 4 years ago

matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing

richvdh commented 3 years ago

matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing

as noted over on matrix-org/matrix-spec-proposals#1598, the merge of that MSC dates to a time where MSCs could be merged before implementation, so although matrix-org/matrix-spec-proposals#1597 presents a proposal which seems to have some support, it's never been adopted by the ecosystem.

richvdh commented 3 years ago

some notes on this while I'm in the area:

turt2live commented 3 years ago

for additional context:

richvdh commented 3 years ago

@neilalexander reports that Dendrite's media ids are

Hex 64 characters I think

Which I assume means that they consist of the characters [0-9a-f].

It'd be interesting to confirm what other media repo impls (eg Conduit) use for their media IDs, but I'd be a bit surprised if they fell outside the range proposed by MSC1597, which, for the record, is:

must be strings consisting entirely of the characters [0-9a-zA-Z.=_-]. Their length must not exceed 255 characters and they must not be empty.

I wonder what we can do to progress this, or what's actually blocking us fixing it.

KitsuneRal commented 3 years ago

AIUI, enforcement might imply that homeservers reject requests to download media with a non-compliant id. That would make the particular media unreachable through normal means but at least clients would be exempted from checking for stray / or an unescaped #.

Ideally, of course, the grammar should be enforced when uploading media; but I have no good ideas here except client applications nagging users to bug homeserver owners if what the client app received for a media id is not compliant.

richvdh commented 3 years ago

per https://github.com/matrix-org/matrix-doc/pull/1597#discussion_r561097599, I think allowing = and forbidding ~ is an error.

turt2live commented 3 years ago

oh, and ftr mmr's IPFS media ID is ipfs:someotherid per https://github.com/matrix-org/matrix-doc/pull/2706

turt2live commented 7 months ago

It appears Ruma discovered the text in the spec well before we did:

https://github.com/ruma/ruma/blob/68c9bb0930f2195fa8672fbef9633ef62737df5d/crates/ruma-identifiers-validation/src/mxc_uri.rs#L20-L22

https://spec.matrix.org/v1.9/client-server-api/#security-considerations-5

[...] homeservers MUST sanitise mxc:// URIs by allowing only alphanumeric (A-Za-z0-9), _ and - characters in the server-name and media-id values.

Therefore, this is a clarification issue imo. In practice, Synapse clearly does not apply this sanitization, but aside from edge cases where folks are using |, :, etc characters the vast majority of implementations appear to be aligned on generating compatible IDs.

Extending the character set requires an MSC.

morguldir commented 7 months ago

For the record, original PR and jira issue here: https://github.com/matrix-org/matrix-spec-proposals/pull/103 https://matrix.org/jira/browse/SPEC-165

also: https://github.com/element-hq/synapse/issues/1323