Open joepie91 opened 5 years ago
This is mentioned under #174, and an attempt to answer it is in the as-yet-unimplemented MSC1597: https://github.com/matrix-org/matrix-doc/blob/rav/proposals/id_grammar/proposals/1597-id-grammar.md#opaque-ids.
Whoops, my bad - I didn't realize that that included MXC identifiers as well. I guess this one can be closed, then?
It's probably useful to keep it open as a specific place to discuss media ids
matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing
matrix-org/matrix-spec-proposals#1597 (a.k.a. matrix-org/matrix-spec-proposals#1598) was merged as an MSC, so really this is just a spec-pr-missing
as noted over on matrix-org/matrix-spec-proposals#1598, the merge of that MSC dates to a time where MSCs could be merged before implementation, so although matrix-org/matrix-spec-proposals#1597 presents a proposal which seems to have some support, it's never been adopted by the ecosystem.
some notes on this while I'm in the area:
[a-zA-Z]
(https://github.com/matrix-org/synapse/blob/v1.25.0/synapse/rest/media/v1/media_repository.py#L160, https://github.com/matrix-org/synapse/blob/v1.25.0/synapse/util/stringutils.py#L36./
, #
or ?
: https://github.com/matrix-org/synapse/blob/v1.25.0/synapse/storage/databases/main/room.py#L663.for additional context:
@neilalexander reports that Dendrite's media ids are
Hex 64 characters I think
Which I assume means that they consist of the characters [0-9a-f]
.
It'd be interesting to confirm what other media repo impls (eg Conduit) use for their media IDs, but I'd be a bit surprised if they fell outside the range proposed by MSC1597, which, for the record, is:
must be strings consisting entirely of the characters
[0-9a-zA-Z.=_-]
. Their length must not exceed 255 characters and they must not be empty.
I wonder what we can do to progress this, or what's actually blocking us fixing it.
Is there any reason we can't simply add an explicit grammar for media IDs to the spec, along the lines of MSC1597's proposal? It doesn't seem like it would be a breaking change in practice. If we did so, should we do a new MSC, or are we happy with MSC1597 to stand in for it?
If the spec today did give a specific grammar for media IDs, what difference would that make in practice? Would the grammar actually be enforced anywhere, or would it simply be for reference? If it's not enforced, we run the risk of one implementation breaking another by using IDs outside the grammar, but what would enforcement look like?
AIUI, enforcement might imply that homeservers reject requests to download media with a non-compliant id. That would make the particular media unreachable through normal means but at least clients would be exempted from checking for stray /
or an unescaped #
.
Ideally, of course, the grammar should be enforced when uploading media; but I have no good ideas here except client applications nagging users to bug homeserver owners if what the client app received for a media id is not compliant.
per https://github.com/matrix-org/matrix-doc/pull/1597#discussion_r561097599, I think allowing =
and forbidding ~
is an error.
oh, and ftr mmr's IPFS media ID is ipfs:someotherid
per https://github.com/matrix-org/matrix-doc/pull/2706
It appears Ruma discovered the text in the spec well before we did:
https://spec.matrix.org/v1.9/client-server-api/#security-considerations-5
[...] homeservers MUST sanitise
mxc://
URIs by allowing only alphanumeric (A-Za-z0-9
),_
and-
characters in theserver-name
andmedia-id
values.
Therefore, this is a clarification issue imo. In practice, Synapse clearly does not apply this sanitization, but aside from edge cases where folks are using |
, :
, etc characters the vast majority of implementations appear to be aligned on generating compatible IDs.
Extending the character set requires an MSC.
For the record, original PR and jira issue here: https://github.com/matrix-org/matrix-spec-proposals/pull/103 https://matrix.org/jira/browse/SPEC-165
From the spec:
However, it is not specified anywhere (as far as I can tell) what the valid character range is for these opaque
media-id
s. In particular, it doesn't specify whether themedia-id
may contain slashes - a detail that's quite semantically important for a lot of HTTP request routing implementations, which often treat a "URL parameter" as a string of any characters other than a slash, eg. as in/media/:id
where/media/foo/bar
would not match.