[Draft] Include ':' and '@' in pchar definition for url encoding

fawind commented 3 months ago

Before this PR

Putting this up for potential discussion - not sure if we actually want to make this change.

Context: We ran into the case where requests of a dialogue client get rejected by google-container-registry because dialogue would url encode the colon : in path segments (e.g. sha256:c48bxxx) while GCR only accepts non-encoded : in path segments.

Looking into Dialogue's url encoding, I noticed that Dialogue's implementation doesn't fully match the referenced RFC-3986. Most notably, Dialogue is defining the pchar matcher as pchar = unreserved, while the RFC is a bit more permissive here and also includes sub-delims, :, and @:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

Note that we have another explicit divergence for query params but this one is well documented and for compatibility reasons:

https://github.com/palantir/dialogue/blob/05ea07174aa8d9ea77cd09585f0852ea3554b354/dialogue-core/src/main/java/com/palantir/dialogue/core/BaseUrl.java#L247-L251

Unclear points:

In the RFC, pchar also includes sub-delims. But given the comment above, it seems like we want to purposfully encode sub-delims?
This logic has been around since early 2019 (PR). Given this never came up as an issue, maybe we don't feel like its worth touching this code?
This is just a spec, and it's hard for me to judge the impact of such a change across all the consumers.

After this PR

Extend the pchar matcher to also include : and @. This will result in those characters no longer being url encoded in path segments.

==COMMIT_MSG== Include ':' and '@' in pchar definition for url encoding ==COMMIT_MSG==

Possible downsides?

changelog-app[bot] commented 3 months ago

Generate changelog in changelog-dir>`changelog/@unreleased`</changelog-dir

What do the change types mean?

- `feature`: A new feature of the service. - `improvement`: An incremental improvement in the functionality or operation of the service. - `fix`: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way. - `break`: Has the potential to break consumers of this service's API, inclusive of both Palantir services and external consumers of the service's API (e.g. customer-written software or integrations). - `deprecation`: Advertises the intention to remove service functionality without any change to the operation of the service itself. - `manualTask`: Requires the possibility of manual intervention (running a script, eyeballing configuration, performing database surgery, ...) at the time of upgrade for it to succeed. - `migration`: A fully automatic upgrade migration task with no engineer input required. _Note: only one type should be chosen._

How are new versions calculated?

- ❗The `break` and `manual task` changelog types will result in a major release! - 🐛 The `fix` changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease. - ✨ All others will result in a minor version release.

Type

- [ ] Feature - [ ] Improvement - [ ] Fix - [ ] Break - [ ] Deprecation - [ ] Manual task - [ ] Migration

Description

Include ':' and '@' in pchar definition for url encoding

**Check the box to generate changelog(s)** - [ ] Generate changelog entry

schlosna commented 3 months ago

Also relevant in RFC 3986 Section 2.4. When to Encode or Decode which seems to indicate that it should be valid to percent-encode the colon in the path component part, though it would not be required as it is not an unreserved character

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

carterkozak commented 3 months ago

We attempt to aggressively encode parameters because not all server implementations implement the same spec (or do so correctly). For most known server implementations, this produces slightly more verbose, but less ambiguous results.

It's possible that this proposal wouldn't harm compatibility with known webservers, but that's difficult to know ahead of time

palantir / dialogue