w3c / did-core

W3C Decentralized Identifier Specification v1.0
https://www.w3.org/TR/did-core/
Other
395 stars 93 forks source link

Fix improperly URI Encoded values in DID parameters. #813

Open msporny opened 2 years ago

msporny commented 2 years ago

The DID parameter values that we had in the specification were never properly URI Encoded per the rules for query parameters in RFC3986 (which is what we normatively reference). This PR fixes that, and is an editorial change (because it only updates examples). We got the normative reference right, but flubbed on the DID parameter examples.


Preview | Diff

dlongley commented 2 years ago

@peacekeeper,

Per RFC 3986, wherever characters in a URI component would cause a conflict with other characters used as delimiters, the conflicting data must be percent-encoded. In the example you highlighted, we have a / character appearing inside of a query component, which conflicts with the use of / to delimit the path. Therefore, the / character inside of the query must be percent-encoded.

See Section 2.2, specifically (emphasis mine):

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

See also: https://stackoverflow.com/a/6017928/399274

Now, regarding the #degree in the example, if it is not percent-encoded, then it must be interpreted as a reference into the DID document. If it is percent-encoded, then it must be interpreted as a reference into the external resource (that would be retrieved via a process involving the rest of the relativeRef value). Based on the naming of the fragment here, I think the intent was for it to be a reference into the external resource, so it should be percent-encoded.

peacekeeper commented 2 years ago

wherever characters in a URI component would cause a conflict with other characters used as delimiters, the conflicting data must be percent-encoded

Yes, but a / inside a query component does not cause a conflict, because in that component it is not used as a delimiter. Section 3.4. explicitly says that / is valid in the query component. And Section 2.1 says:

A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.

In the case of query, the / is neither outside the allowed set, nor is it used as a delimiter of, or within, the component.

Now, regarding the #degree in the example, if it is not percent-encoded, then it must be interpreted as a reference into the DID document. If it is percent-encoded, then it must be interpreted as a reference into the external resource

I don't think that's how it works. The fragment is used as a reference to something that is part of, or related to, the result of dereferencing the primary resource (without fragment). That primary resource may be the DID document, or it may be something else, such as an external resource. The relativeRef parameter is only one example where DID URLs can be dereferenced to something other than the DID document. Another example is the resource parameter. It is not up to us to "redefine" how fragments in URIs work.

dlongley commented 2 years ago

Additionally, we would be asking for pain to deviate from common tooling here:

const params = new URLSearchParams({relativeRef: '/test'});
console.log(params.toString());

Prints:

'relativeRef=%2Ftest'

My sense is that, almost invariably, whenever people build query strings (especially using common tooling), they URI encode the component parts. This is inline with the interpretation of RFC 3986 I presented above.

TallTed commented 2 years ago

@dlongley -- I'm afraid that common tooling is in error.

A / is parsed as part of the path segment until the first un-encoded ?, which starts the query segment, after which point, / is no longer interpreted as part of the path segment.

Yes, these strings are confusing to humans, and when humans write software, because humans do a terrible job of internalizing the rules, they may persist their misunderstandings.

Please see 3.4. Query which explicitly states this --

      query       = *( pchar / "/" / "?" )

-- and incorporates these by reference --

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
      pct-encoded   = "%" HEXDIG HEXDIG
      unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
      sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                    / "*" / "+" / "," / ";" / "="

The un-encoded solidus / is valid within the query and fragment components, which follow the first un-encoded ? and #, respectively.

dlongley commented 2 years ago

@TallTed,

In re-reading the relevant sections of RFC3986 again, I think both the percent-encoded and non-percent-encoded versions of the example are legal. Of course, whether the fragment part is encoded or not will change its interpretation, specifically, whether or not it is part of the relativeRef value. I recommend we offer both examples to resolve this.

dlongley commented 2 years ago

As for the history here, this PR added relativeRef: https://github.com/w3c/did-core/issues/349

From Markus (emphasis mine):

"Since matrix parameters were removed from the DID URL syntax, one consequence is that an extra DID URL parameter is needed that can contain an encoded service endpoint path and query string."

The example there clearly encodes the / and # characters. That PR also links to my comment here with more exposition:

https://github.com/w3c/did-core/issues/159#issuecomment-598286814

peacekeeper commented 2 years ago

I agree with @dlongley that both an encoded and unencoded value of the relativeRef parameter would be valid. You're right that I have also used encoded examples a few times, see e.g.:

https://w3c-ccg.github.io/did-resolution/#example-5

And I can understand why it would be preferable to encode some characters such as / even if they don't have to be encoded. So I think I wouldn't strongly oppose making that change. I'm a bit worried however:

  1. That this might be a breaking change, since we only referenced RFC3986 Section 2.1 and didn't further specify exactly which characters need to be encoded: https://www.w3.org/TR/did-core/#did-parameters
  2. This might complicate things if we ever embed a DID URL inside e.g. an HTTP URL, then some characters may get double-percent-encoded, or there may be ambiguity on which level of the nested URLs a character was percent-encoded. But I guess that could be solved.

Regarding the fragment in the relativeRef case, I feel pretty strongly that this should not be percent-encoded, even though @dlongley is right that I also used examples of percent-encoded fragments in the past. The example I linked above at the beginning of this message shows this:

did:example:1234?service=files&relativeRef=%2Fmyresume%2Fdoc%3Fversion%3Dlatest#intro

The reason why I really prefer this is that processing of the fragment should happen after the primary dereferencing step. After processing service and relativeRef, the constructed service endpoint URL "inherits" the fragment, similar to how that works in HTTP redirects.

TallTed commented 2 years ago

Unfortunately, encoding characters when they shouldn't be shouldn't produce the same result as leaving them unencoded in that location. %2F is not the same as /. Dereferencing a URL containing a parameter value of %2Ftest (where the actual target is /test) should result in a 4xx. Tragically, these errors are common enough that many implementers will check whether parameter values seem to be encoded, and deliver the results that are obtained after decoding. Postel's robustness principle is the usual excuse. I think that even if delivering payloads on that basis, some feedback should be provided that the encoding is a wrongness that should be corrected in future requests for that resource.