package-url / purl-spec

A minimal specification for purl aka. a package "mostly universal" URL, join the discussion at https://gitter.im/package-url/Lobby
https://github.com/package-url/purl-spec
Other
673 stars 157 forks source link

Percent encoding spec and : and / #39

Open jdillon opened 6 years ago

jdillon commented 6 years ago

The notes in the specification about percent encoding of ":" are a bit confusing:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere
the ':' type separator does not need to and must NOT be encoded. It is unambiguous unencoded everywhere

It seems like these 2 contradict either other. The former indicates that ":" may need to be encoded elsewhere. The later indicates that ":" is "unambiguous unencoded everywhere".

Similarly the qualifier component documentation says:

A value must be must be a percent-encoded string

... and does not mention anything about "/". But the test-suite-data.json references canonical_purl representations like repository_url=repo.spring.io/release.

And the percent-encoding docs state:

the '/' used as namespace/name and subpath segments separator does not need to and must NOT be percent-encoded. It is unambiguous unencoded everywhere

My interpretation of this boils down to... "/" is never encoded, but the language in the parts of the specification are unclear. The name, namespace and subpath parts are clear wrt to "/", which leaves the qualifier and version bits as vague as to if "/" is supposed to be percent-encoded or not.

stevespringett commented 6 years ago

With reference to:

pkg:docker/customer/dockerimage@sha256:244fd47e07d1004f0aed9c?repository_url=gcr.io

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

Therefore, the canonicalized purl should be changed to:

pkg:docker/customer/dockerimage@sha256%3A244fd47e07d1004f0aed9c?repository_url=gcr.io

We need clarification on this.

stevespringett commented 6 years ago

With reference to:

pkg:Maven/org.apache.xmlgraphics/batik-anim@1.9.1?repositorY_url=repo.spring.io/release&classifier=sources

It is my opinion that the testsuite in the .NET, Python, and Java implementations as well as in the spec repo are incorrect and violate the spec. Whereas the testsuite in the Rust implementation adheres to the spec (as written).

Spec states:

Therefore, the canonicalized purl should be changed to:

pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repository_url=repo.spring.io%2Frelease

We need clarification on this.

stevespringett commented 6 years ago

Based on my understanding of the spec, there are a total of three test cases that violate the spec as its currently written. The Rust implementation has went ahead and changed its test cases, which I'm in agreement with. But we need clarification as I've run into a situation with the Java implementation which requires additional encoding, but the additional encoding breaks the above test cases. However, adopting the testcase changes made by the Rust impl makes the Java impl happy as well.

stevespringett commented 5 years ago

@pombredanne this ticket should be resolved prior to 1.0 #50

gernot-h commented 5 years ago

While I also stumbled hard about this and also would welcome a clearer wording, I see no contradiction here and would consider current testcases correct.

First statement summarizes where the #', '?', '@' and ':' characters MUST NOT be encoded, but it makes no definite statement about where they MUST be encoded. This is further clarified in the following statement which clearly says that it is "unambiguous unencoded everywhere".

To keep purls readable for humans, I think encoding shall be avoided where possible. Think about Debian versions - something like pkg:deb/debian/file@1%3A5.30-1%2Bdeb9u2?arch=amd64 looks quite unreadable to me.

pombredanne commented 5 years ago

@stevespringett re

Therefore, the canonicalized purl should be changed to: pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?classifier=sources&repository_url=repo.spring.io%2Frelease We need clarification on this.

let's update the tests to match the spec and ...

@gernot-h very good point. In fact on common packages such as scoped npms that have a namespace that has an @ sign prefix the encoding was added because it was encoded in the public npm registry. But this is no longer the case and this is not needed as there are never any ambiguities there. And your debian example is also telling. So let's devise the bare absolute minimum encoding that would be needed such that Package URLs stay non ambiguous and are the most readable in the majority of the cases.

gernot-h commented 5 years ago

@pombredanne Great, anything I can do to help sorting this out? Prepare some pull requests? (I will be on vacation for the next 1.5 weeks, but be interested to help afterwards)

Regarding the confusion about the two statements regarding ":", what about rephrasing the first statement to:

From:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. They may need to be encoded elsewhere

To:

the '#', '?', '@' and ':' characters must NOT be encoded when used as separators. Some of them need to be encoded elsewhere as specified in the rules below.

matt-phylum commented 1 year ago

The WHATWG URL spec has a great section on percent encoding that defines a set of character classes and their exact encoding rules. It would be great if the PURL spec defined its percent encoding rules in relation to those. For example, "the PURL path percent-encode set is the WHATWG URL path percent-encode set and U+0040 (@)".

michaelbprice commented 1 year ago

My interpretation of the rules when putting together a PR (#245) for a vcpkg purl type, was that any value in a key=value qualifier should be percent-encoded, and the tests I've added there reflect that decision. It tests that after parsing, the resulting values are not percent-encoded. IMO, readability of purl strings themselves are secondary to unambiguous and fast parsing by tools (which then might have a presentation layer that helps users understand the data).

matt-phylum commented 12 months ago

This comment is about qualifiers only. The PURL spec is fairly clear (IMO) that colons should not be encoded in version numbers, but could be more explicit.

This ticket was originally about qualifiers and I've found a potentially serious problem with the way qualifiers are encoded and decoded.

Neither PURL nor the WHATWG or RFC URL specs say that : or / need to be encoded when used in a query string. Any character is allowed to be encoded, so technically it's allowable, even if not canonical, and the qualifier should parse correctly either way.

However, PURL qualifiers look suspiciously like x-www-form-urlencoded parameters, and people may be tempted to use the x-www-form-urlencoded rules for encoding and decoding qualifiers. The WHATWG x-www-form-url-encoded spec does not require them to be encoded when parsing, but says that : and / do need to be encoded when serializing (as well as many other characters that the PURL spec never mentions like ().

If the set of encoded characters were the only difference, it would be fine if some implementations were using x-www-form-urlencoded and some were not, but there is one critical difference with x-www-form-urlencoded: + is replaced with ` during parsing, so it must be encoded as%2Bwhen serializing. The current PURL spec never says anything about special handling of+`. If some PURL implementations implement qualifiers using x-www-form-urlencoded and some implement qualifiers exactly as written in the current PURL spec, the implementations will interoperate just fine most of the time, but for certain cases the meaning of the qualifier changes when passing from one implementation to the other.

Besides needing to be more explicit about what characters need to be encoded in what parts of the PURL, to avoid PURLs that have different meanings to different implementations, the spec needs to clearly specify whether the qualifiers are x-www-form-urlencoded or if they are a simpler format that is like x-www-form-urlencoded but not.

jdalton commented 3 weeks ago

FYI package-url/packageurl-js has taken the following approach in their v2.0.0 release. (From comment https://github.com/package-url/packageurl-js/issues/43#issuecomment-2289941728):

This is handled in https://github.com/package-url/packageurl-js/pull/73 by using URLSearchParams to encode and then turning + into %20 for better portability. I sided with the Rust implementation.

Also leveraging standard URLSearchParams. Deferring to standard encoders like URLSearchParams and encodeURIComponent for base encoding and then applying tweaks allows for less chances of mistakes (I trust standard implementations over myself).

matt-phylum commented 3 weeks ago

It's inconsistent in packageurl-js 2.0.0. Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

jdalton commented 3 weeks ago

@matt-phylum

Plus signs parse as plus signs, except for in qualifier values where they parse as spaces.

URLSearchParams (docs) are qualifier specific. I was referring to how qualifiers are encoded. I'm saying that in v2+ we encode the slashes in the qualifier as Rust does. We also follow Rust in that we do NOT encode spaces as +. The JS implementation is aligned with Rust on encoding of slashes as %2F and spaces as %20 in encoded qualifiers.