package-url / purl-spec

A minimal specification for purl aka. a package "mostly universal" URL, join the discussion at https://gitter.im/package-url/Lobby
https://github.com/package-url/purl-spec
Other
699 stars 161 forks source link

Add new spec for `go` package URLs #338

Open maceonthompson opened 3 weeks ago

maceonthompson commented 3 weeks ago

The current PURL specification for Go was created before Go 1.11 modules and thus has namespace inconsistencies and lacks semantic versioning.

Although in many cases a module path corresponds directly to the URL of the hosting repository, that is not always true. The URL formed from the module path may be an endpoint that serves a redirect to the true host. This indirection protects projects that for whatever reason must change their hosting provider: their module names will continue to work. Consequently, it is undesirable to encode any aspect of the underlying hosting system as part of the PURL.

In essence, all Go modules form a single namespace. Since it is used by the majority of Go programmers, we propose to represent this namespace by the empty string. Though not included in this commit, other namespaces could be possible and would represent package managers and/or build tools that are alternatives to the go command.

The go type proposed here fixes the current issues by removing the namespace, using valid Go module versions (including pseudoversions), and adds some extra functionality to encode optional information about specific builds (GOOS, GOARCH, etc).

If accepted, all tools maintained by the Go project (such as govulncheck and pkg.go.dev) that surface PURLs will use this new type to provide canonical PURLs for Go modules and packages

maceonthompson commented 3 weeks ago

See also #196 #294 #308

Thanks for pointing at these! This is essentially a combination of #196 and #308 (with the addition of qualifiers for build info). They go into more detail than this proposal, but especially in the case of namespaces https://github.com/package-url/purl-spec/issues/63#issuecomment-1276924737 is a good example as to why dropping name in favor of an entirely coded namespace would be more useful. I understand that having a bunch of %2F in the PURL is ugly for humans, but is (we feel) necessary to ensure that go PURLs are consistent (which is to say that go module -> PURL is injective, a go module cannot be represented by different PURLs).

Say you have a module with the path host.com/maybeuser/module. With the current type definition, both pkg:golang/host.com/maybeuser/module and pkg:golang/host.com/maybeuser%2Fmodule, could represent that module. In order for PURLs to canonically and uniquely define go modules in the way that they are defined on pkg.go.dev or the go module proxy, they must be unique as well.

matt-phylum commented 3 weeks ago

Say you have a module with the path host.com/maybeuser/module. With the current type definition, both pkg:golang/host.com/maybeuser/module and pkg:golang/host.com/maybeuser%2Fmodule, could represent that module. In order for PURLs to canonically and uniquely define go modules in the way that they are defined on pkg.go.dev or the go module proxy, they must be unique as well.

I think the better solution to this problem is that pkg:golang/host.com/maybeuser%2Fmodule stays illegal. It'd be better if the documentation explicitly stated it were illegal, but based on the examples and test cases the correct form is pkg:golang/host.com/maybeuser/module, and based on the reference parsing and formatting algorithms it's clear that these PURLs are distinct.

However, "a go module cannot be represented by different PURLs" is not generally the case:

jkowalleck commented 2 weeks ago

This is a breaking change that affects all software utilizing PURL for Go.

I'd disagree. In fact, it is non-breaking, as it adds a completely new purl type. Therefore, no breaking changes are introduced.

matt-phylum commented 2 weeks ago

It is breaking because no existing PURL software expects pkg:go, and new PURL software will not expect pkg:golang. This creates a compatibility problem where either the PURL is rejected as an unrecognized type or software on different sides of the breakage don't understand each other. If this is merged, all software that works with Go PURLs will need to be updated to accept both types of Go PURL and convert before they interoperate again.

jkowalleck commented 2 weeks ago

It is breaking because no existing PURL software expects pkg:go [...]

this is true to every newly proposed PURL Type :-) And none of them is a breaking change - neither in spec nor in behaviour.

this PR is trying to add a new type go. the existing golang is not touched at all.

matt-phylum commented 2 weeks ago

The problem is that this is not a new type. The go type is intended to replace golang.

jkowalleck commented 2 weeks ago

The problem is that this is not a new type.

it is not? Could you point me to the existing go type?

The go type is intended to replace golang.

I wonder how you come to this conclusion. this very PR adds a new type, it does neither obsolete nor deprecate the existing golang type.

matt-phylum commented 2 weeks ago

I wonder how you come to this conclusion. this very PR adds a new type, it does neither obsolete nor deprecate the existing golang type.

From the PR description:

If accepted, all tools maintained by the Go project (such as govulncheck and pkg.go.dev) that surface PURLs will use this new type to provide canonical PURLs for Go modules and packages

golang is the type currently used for Go modules and packages. For example: https://github.com/anchore/syft/blob/3c070e0ad9d69c0f2191be52e2f2fb4904bcd558/syft/pkg/cataloger/golang/package_test.go#L24 . This PR is introducing a second, more preferred type for the same purpose.

jkowalleck commented 2 weeks ago

I wonder how you come to this conclusion. this very PR adds a new type, it does neither obsolete nor deprecate the existing golang type.

From the PR description:

If accepted, all tools maintained by the Go project (such as govulncheck and pkg.go.dev) that surface PURLs will use this new type to provide canonical PURLs for Go modules and packages

which is a behavioural change in a downstream application. This is out of scope of this spec, and not in our hands at all - we have no authority there.

golang is the type currently used for Go modules and packages. For example: https://github.com/anchore/syft/blob/3c070e0ad9d69c0f2191be52e2f2fb4904bcd558/syft/pkg/cataloger/golang/package_test.go#L24 . This PR is introducing a second, more preferred type for the same purpose.

exactly this paragraph makes it clear: this is a non-breaking change.

Causing no breaking change is the whole point of introducing a new purl type, instead of modifying an exising one.

jkowalleck commented 2 weeks ago

I'm sure it can be fixed without making this level of breaking change.

i don't think so. #308 makes this clear: the existing spec has flaws that require breaking changes to fix them

The only way to fix golang is

matt-phylum commented 2 weeks ago

Introducing a new type for an existing type is a breaking change to the PURL ecosystem. Implementations that use golang can continue to use golang and their golang PURLs will still be golang PURLs, but PURL has no negotiation mechanism where all the software that's going to read the PURLs agrees with the software that writes the PURLs on whether to use go or golang to describe Go dependencies.

If you start writing SBOMs that have go, they will be processed incorrectly by software that doesn't support go. If you continue writing SBOMs that have golang, they will be processed incorrectly by software that doesn't support golang. If you combine SBOMs using software that doesn't understand that go and golang are really the same type, the dependencies will be duplicated in the output. If you query go or golang packages against a vulnerability database, you have a 50/50 chance of finding the vulnerabilities unless the database understands both and converts golang to go.

Keeping golang is incompatible with the "a go module cannot be represented by different PURLs" goal of this PR.

You cannot just fix a PURL type by introducing a new type. Even if PURL libraries are updated to support transparently upgrading the old type into the new type on read, any software that is comparing pre-canonicalized PURL strings will need updates.

the existing spec has flaws that require breaking changes to fix them

What are the flaws that require breaking changes? #308 is about the path being incorrectly converted to lowercase, which is much more easily fixed by just not doing that.

jkowalleck commented 2 weeks ago

Introducing a new type for an existing type is a breaking change to the PURL ecosystem.

how?

If a tool that produced purls would change it's behaviour by using the new purl-type, where they've used the other one before - this would be a breaking change in that very tool. This is out of the scope of the purl spec -- we do not have authority there.

Implementations that use golang can continue to use golang and their golang PURLs will still be golang PURLs, but PURL has no negotiation mechanism where all the software that's going to read the PURLs agrees with the software that writes the PURLs on whether to use go or golang to describe Go dependencies.

So? This is true to every purl type that is added over time. An implementation written 2 years ago might not know the purl type that was defined yesterday. This is by design and was never an issue. This is out of the scope of the purl spec -- we do not have authority there.

Keeping golang is incompatible with the "a go module cannot be represented by different PURLs" goal of this PR.

A PR tells a story, and the effective patch gets updated along with the discussions on a PR. the initial PR description is usually not updated in accordance with the effective patch.

(PS: I review the content of the PR. and at the time of review, I saw no breaking change. I was starting the "breaking" discussion in expectation that you'd agree that is no longer a breaking change, based on the current state of the PR. I am happy we are discussing the topic anyway, i might be wrong, and I still need to learn.)

You cannot just fix a PURL type by introducing a new type. Even if PURL libraries are updated to support transparently upgrading the old type into the new type on read, any software that is comparing pre-canonicalized PURL strings will need updates.

how comes?

the existing spec has flaws that require breaking changes to fix them

What are the flaws that require breaking changes? #308 is about the path being incorrectly converted to lowercase, which is much more easily fixed by just not doing that.

the curerent golang spec says: the path MUST be lowercased. This is wrong in terms of actual go dependency management: the path MUST NOT be lowercased. Changing MUST to MUST NOT in golang purl-type is a breaking change of the specification.

matt-phylum commented 2 weeks ago

Adding a new type for a new type is much different than adding a new type for an existing type. An old tool not recognizing a truly new type is expected, but an old tool not recognizing Go PURLs anymore because a tool producing the data says that golang is now spelled go is a breaking change. You can argue that this isn't a breaking change in the PURL spec itself because it doesn't change golang, but it necessitates a breaking change in every current implementation of Go PURLs and complicates implementations of Go PURL consuming software as long as there are both go and golang PURLs going around.

Changing "MUST be lowercased" to "MUST NOT be lowercased" is a much less impactful change than this. From what I've seen, names with uppercase characters are uncommon, and an outdated implementation that is incorrectly lowercasing is still working correctly for all names that do not contain uppercase characters to lowercase. I would even say that on a larger scale it is not a breaking change because:

In both cases, the PURL is still parsed successfully and the meaning of the PURL is unchanged with respect to the current "MUST be lowercased" spec. The only differences would be that the canonical form changes¹ and a new consumer receiving a PURL from an old producer might be more likely to expect that the ID refers to the correct package, but since there is no good way for an outdated consumer to recover the correct ID after an outdated producer lowercases it, any consumer that relies on getting the correct ID (eg to resolve the package files) is likely already broken and not lowercasing the name can only improve the behavior in that situation.

This causes the same alignment problems as introducing a go type, except that if the correct ID is lowercase, no problem occurs because lowercasing is already producing the correct PURL.

¹ Due to underspecification in the text and tests, I wouldn't trust incoming PURLs to be in the canonical form as my implementation understands it. There are numerous minor differences in which characters are escaped when (and sometimes how), so if you're accepting PURLs from an external source, even if you don't expect user-entered, non-canonical PURLs in that source, you should be canonicalizing those PURLs yourself if your application depends on them all being canonical for the same definition of canonical.

matt-phylum commented 1 week ago

Go isn't the only ecosystem that has this problem of incorrect name normalization rules in this repo. I'm also aware of:

zpavlinovic commented 1 week ago

Introducing a new type for an existing type is a breaking change to the PURL ecosystem.

If this is indeed true, then there is something really wrong with PURL: it does not allow for evolution. On the one hand, we cannot add modifications to the existing specification that could introduce breaking changes. On the other hand, we cannot introduce a new type because somehow that is a breaking change as well. So one is pretty much stuck with slight variations of the initial spec. Specs should be allowed to evolve just the way the software does.

There should really be a way to add versioning on top of PURL itself. What is being proposed here might in essence be just that for the go spec.

pombredanne commented 1 week ago

@maceonthompson Thanks for putting this together! this makes a lot sense, and we have an issue with Go alright. Let me look at the comments in details and come back with my 2 cents!

pombredanne commented 1 week ago

@matt-phylum re:

Introducing a new type for an existing type is a breaking change to the PURL ecosystem.

I am not sure that's hte case, but a new type vs. updating the existing type demands some careful thinking :)

pombredanne commented 1 week ago

BTW, an elephant in the room is whether the distinction between a namespace and name makes sense not only here, but also in the whole spec, globally.

I found myself using a variable with a "namespace/name" substring more often than not. Then, how to split this in optional namespace and name could become a type-specific distinction, but the general concept would be that of "namespace/name", which could look like:

With this the whole google.golang.org/genproto/googleapis/api/annotations would be the namespace/name and would not have a specific split in Go, all would be in the name? (and the same could apply where relevant to other package types)

It could have a minimal impact on the spec.

pombredanne commented 1 week ago

@matt-phylum you wrote

See also:

  • 196

  • 294

  • 308

This is a breaking change that affects all software utilizing PURL for Go. Personally, I don't think there's anything fundamentally wrong with pkg:golang except that the description is outdated, and I'm sure it can be fixed without making this level of breaking change.

Thanks for the links! I tend to think along the same lines, and we can likely salvage the golang type.

Maintaining the separation of namespace and name and putting the entire Go package ID into the PURL name makes PURLs difficult for human users to work with.

I need to pounder this. See my other comment wrt. the namespace/name above in https://github.com/package-url/purl-spec/pull/338#issuecomment-2483190304

zpavlinovic commented 1 week ago

However, "a go module cannot be represented by different PURLs" is not generally the case:

  • The PURL spec describes a canonical format for PURLs, but users and even commonly used PURL implementations often get this wrong and produce non-canonical PURLs which must still be considered equal. For example, pkg:golang/host%2Ecom/maybeuser/module is a non-canonical, valid, PURL which refers to the same package.
  • A PURL may have qualifiers which may or may not be critical to the PURL. A PURL with a ?goarch is a different PURL which refers to the same module, but a PURL with a ?repository_url (or however the module proxy is specified) is a different PURL which may refer to a different module (probably more likely in other ecosystems).

It is fine that PURL spec allows for more flexibility, but there should be only one way the Go module and package information is encoded. This simplifies the work for clients. It is easy to drop qualifiers from a PURL. It is annoying to generate multiple module+package encodings to see if the incoming PURL applies to your code.

In general, this proposal tries to make it simple and clear to generate and accurately check against PURLs. It might not be the most user-friendly solution, but tools that render PURLs can easily prettify the output. We believe this is worth the sacrifice.

rhalar commented 6 days ago

Could it also be clarified how standard library packages are to be represented?

Go has special handling for these, and the 'module' is never explicitly required when using them. But the module does exist for std and cmd https://github.com/golang/go/blob/master/src/go.mod#L1 https://github.com/golang/go/blob/master/src/cmd/go.mod#L1

Go uses stdlib when reporting vulnerabilities though https://vuln.go.dev/ID/GO-2024-3105.json

but the exact module name would make more sense we believe.

matt-phylum commented 6 days ago

stdlib is probably good, if it needs to be specified (maybe specifying the compiler is enough). If it were std or cmd it would be a special case for tools that are not necessarily Go-centric to handle the case that pkg:golang/cmd@v1.2.3 is related to stdlib v1.2.3. Maybe pkg:golang/stdlib@v1.2.3#cmd is best if somebody is going to care that it's using the cmd package. However it's done it should be documented.

OSV is already using pkg:golang/stdlib, but that doesn't mean it's necessarily right. https://osv.dev/vulnerability/GO-2024-3105

rhalar commented 6 days ago

I'm not sure I follow how it would make it easier for tools? The problem I have with pkg:golang/stdlib@v1.2.3#cmd is that cmd and std aren't packages, but modules.

matt-phylum commented 6 days ago

Are they modules? They have go.mod files in their source code, but they aren't included in your go.mod and when you import their packages the compiler recognizes that you're trying to use the standard library and provides its own copy of the code from $GOROOT/src instead of downloading a module: https://github.com/golang/go/blob/8f22369136b264567955fb86cff491c247b45b8b/src/cmd/go/internal/modload/build.go#L42-L46

zpavlinovic commented 6 days ago

I'm not sure I follow how it would make it easier for tools? The problem I have with pkg:golang/stdlib@v1.2.3#cmd is that cmd and std aren't packages, but modules.

The cmd and std should not appear as modules in any of the actual go tools/artifacts. These modules are used internally to do vendoring of trusted packages developed independently elsewhere.

I think we can safely use stdlib for standard library or go tools (which are again in standard library) artifacts. We will add this to the proposal.

rhalar commented 5 days ago

I believe I get the argument, and I will defer to your judgement as the more knowledgeable about Go concepts and internals.

However, I do still feel that it would make sense for PURLs to use the "modules". PURLs are primarily unique identifiers so does it really matter what Go tooling does and how the internals work? stdlib comes a bit out of nowhere and it doesn't seem that different from just using std, which seems more consistent at least, even if it technically doesn't follow the same rules. It still identifies the artifact in a unique way. Omitting any of the two is also fine, which would result in just, e.g.

pkg:golang/crypto/ecdsa@...

But I'm guessing that would cause issues, and it also doesn't follow the current convention for PURLs.

As I said, I'll defer to better judgements, just giving a mostly uneducated opinion. :)