w3c / activitypub

http://w3c.github.io/activitypub/
Other
1.24k stars 78 forks source link

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

Open trwnh opened 5 hours ago

trwnh commented 5 hours ago

Related:

ActivityPub developers and implementers using HTTPS identifiers ought to be aware of the "normalization and comparison" considerations for HTTPS URIs.

For HTTPS scheme normalization, refer to RFC 9110 Section 4.2.3: https://datatracker.ietf.org/doc/html/rfc9110#section-4.2.3

For URI syntax normalization, refer to RFC 3986 Section 6: https://datatracker.ietf.org/doc/html/rfc3986#section-6

Some common considerations in imperative form

Considerations that do not exist at URI/HTTPS level and must be considered at a protocol level

Query component processing

Per https://datatracker.ietf.org/doc/html/rfc3986#section-3.4:

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Query components are by default opaque. At the level of an HTTPS URI, the first unencoded ? delimits the query component, which ends only when encountering a # (delimiting the start of the fragment component) or the end of the URI.

Purely by convention, it is common for application servers to try to parse "query parameters" out of the query component of the URI. Arguably this is a misfeature and an antipattern, since the ordering of such query parameters should not have any bearing on the identity of the resource -- /?foo=1&bar=2 is semantically equivalent to /?bar=2&foo=1 when being used to extract request parameters; such "request parameters" should go on the request itself, not on the identifier (which becomes a completely different identifier when the order of the parameters is changed). But the practice of using = and & to parse a query component as a series of request parameters is (unfortunately) quite prevalent, even very widespread (although at some point around the era of HTML4 it was recommended that the delimiter between such "parameters" be ; instead of &.)

ActivityPub should probably also warn about this or give guidance that query components in id are opaque and SHOULD NOT be parsed as parameters for the purposes of reference or equivalence.

If ActivityPub ever prescribed specific query parameter processing, then the ordering of such query parameters needs to be canonicalized with some kind of normalization algorithm.

At the very least, for implementers using the query component to encode request parameters, these implementers SHOULD normalize/canonicalize the order of these parameters when normalizing/canonicalizing their URIs before including them as id on any object(s).

Recommendations