Open fawind opened 3 months ago
Type
Description
Also relevant in RFC 3986 Section 2.4. When to Encode or Decode which seems to indicate that it should be valid to percent-encode the colon in the path component part, though it would not be required as it is not an unreserved character
Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
We attempt to aggressively encode parameters because not all server implementations implement the same spec (or do so correctly). For most known server implementations, this produces slightly more verbose, but less ambiguous results.
It's possible that this proposal wouldn't harm compatibility with known webservers, but that's difficult to know ahead of time
Before this PR
Putting this up for potential discussion - not sure if we actually want to make this change.
Context: We ran into the case where requests of a dialogue client get rejected by google-container-registry because dialogue would url encode the colon
:
in path segments (e.g.sha256:c48bxxx
) while GCR only accepts non-encoded:
in path segments.Looking into Dialogue's url encoding, I noticed that Dialogue's implementation doesn't fully match the referenced RFC-3986. Most notably, Dialogue is defining the pchar matcher as
pchar = unreserved
, while the RFC is a bit more permissive here and also includes sub-delims,:
, and@
:Note that we have another explicit divergence for query params but this one is well documented and for compatibility reasons:
https://github.com/palantir/dialogue/blob/05ea07174aa8d9ea77cd09585f0852ea3554b354/dialogue-core/src/main/java/com/palantir/dialogue/core/BaseUrl.java#L247-L251
Unclear points:
pchar
also includes sub-delims. But given the comment above, it seems like we want to purposfully encode sub-delims?After this PR
Extend the pchar matcher to also include
:
and@
. This will result in those characters no longer being url encoded in path segments.==COMMIT_MSG== Include ':' and '@' in pchar definition for url encoding ==COMMIT_MSG==
Possible downsides?