package-url / purl-spec

A minimal specification for purl aka. a package "mostly universal" URL, join the discussion at https://gitter.im/package-url/Lobby
https://github.com/package-url/purl-spec
Other
693 stars 161 forks source link

How are golang sub-modules supposed to be expressed by purl? #63

Open andrewstein opened 5 years ago

andrewstein commented 5 years ago

I am confused reading the spec for purl in relation to golang sub-modules. For example, looking at the submodule expressed in this go.mod file: https://github.com/go-modules-by-example/submodules/blob/master/a/go.mod, released by the a/v1.0.0 tag: https://github.com/go-modules-by-example/submodules/releases

Is the purl:

  1. pkg:golang/github.com/go-modules-by-example/submodules/a@v1.0.0
  2. pkg:golang/github.com/go-modules-by-example%2Fsubmodules%2Fa@v1.0.0
  3. pkg:golang/github.com/go-modules-by-example@v1.0.0#submodule/a
  4. pkg:golang/github.com/go-modules-by-example/submodule@v1.0.0#a
  5. pkg:golang/github.com%2Fgo-modules-by-example%2Fsubmodules%2Fa@v1.0.0

It basically comes down to what is the namespace (if any), what is the name and what is the sub-path (if any) for this submodule.

andrewstein commented 5 years ago

A followup note: I am not sure that even without golang sub-modules that the spec is reflective for golang. https://github.com/package-url/purl-spec#known-purl-types give the example

pkg:golang/github.com/gorilla/context@234fd47e07d1004f0aed9c

This implies that the namespace is github.com/gorilla and the name is context. This does not seem right to me. I would expect the name to be gorilla/context in the github.com namespace, or preferably github.com/gorilla/context without a namespace. Leading to one of the following purls:

  1. pkg:golang/github.com/gorilla%2Fcontext@234fd47e07d1004f0aed9c
  2. pkg:golang/github.com%2Fgorilla%2Fcontext@234fd47e07d1004f0aed9c

But maybe I am just being argumentative here.

jdillon commented 5 years ago

I don't believe use of subpath here is appropriate, as IIUC subpath is used to point to something inside of a package subpath: extra subpath within a package, relative to the package root.

Its certainly a bit wrinkly with golang modules def of repository and module though, and maybe subpath should be expanded for that use-case? Though I think similar to html anchors and urls using fragments to point to something inside of a page the same thing would apply here for purl to point to something inside of a specific package.

jdillon commented 5 years ago

Regarding the github org/user and repository bits, IIUC golang's module stuff doesn't require a module be a github url or a git repository (though it may mostly commonly be such).

Does not appear that the coordinates used for golangs modules really care about? I didn't (after a very brief scan of the docs) see that the value for require was even defined (but I could have missed it) but looks generally just like a "host:path version"?

For git submodule package looks like the only wrinkle is if you wanted to find the tag, that you need to know the root repository location so you could then figure out what the path to the sub-module was?

It may also depend on what one would do with a golang purl, seems like no matter how you spin it some translation would have to be done, but I think thats probably fine. For example a maven purl with dot notation in groupId would have to get translated to slash notation for resolving a file on disk or remote repository location.

So my guess is that avoiding any front-loaded assumptions on the golang package url is probably simplest, and that your first example:

pkg:golang/github.com/go-modules-by-example/submodules/a@v1.0.0

... is probably reasonable.

Just my 0.02 though... i'm not a golang module expert by far ;-)

jdillon commented 5 years ago

from https://github.com/golang/go/wiki/Modules:

Modules must be semantically versioned according to semver, usually in the form v(major).(minor).(patch), such as v0.1.0, v1.2.3, or v1.5.0-rc.1. The leading v is required. If using Git, tag released commits with their versions. Public and private module repositories and proxies are becoming available (see FAQ below).

If the "leading v is required" then maybe the purl form is:

pkg:golang/github.com/go-modules-by-example/submodules/a@1.0.0

... though its not really clear if thats a hard requirement or not.

bradcupit commented 4 years ago

I think this is an actual problem.

Here's some real life examples:

Go module name namespace name
github.com/gorilla/context github.com/gorilla context :+1:
github.com/Azure/go-autorest/logger github.com/Azure/go-autorest logger :woman_shrugging:
rsc.io/quote/v3 rsc.io/quote v3 :-1:

1st one makes sense to me. 2nd one could go either way: some might consider it correct, others might say it should be namespace = github.com/Azure, name = go-autorest/logger. 3rd one is a problem. Go treats major version numbers as a separate module. So rsc.io/quote v1.0.0 is different than rsc.io/quote/v3 v3.0.0 (and it's illegal to say rsc.io/quote v3.0.0 without the /v3).

To ensure consistency we should document how to handle submodules.

Some low effort options I can think of:

  1. Continue splitting things up the way they are now, where name = v3: rsc.io/quote/v3 and github.com/Azure/go-autorest/logger Action: nothing
  2. Say all submodules and/or major versions are part of the name field: rsc.io/quote%2Fv3 and github.com/Azure/go-autorest%2Flogger Action: add some README examples.
  3. Say Go only cares about the name, not namespace, so all slashes need percent encoding: rsc.io%2Fquote%2Fv3 and github.com%2FAzure%2Fgo-autorest%2Flogger Action: change README examples.
  4. Say the the repository (rsc.io, github.com) is the namespace, everything else is the name: rsc.io/quote%2Fv3 and github.com/Azure%2Fgo-autorest%2Flogger Action: change README examples.
pombredanne commented 4 years ago

The original intent has been to use subpath for Go, but this pre-dates the rise of modules. Actually, AFAICR subpath was added specifically to support Go "packages".

@andrewstein with your examples:

@bradcupit with your examples:

My personal preference would be avoid overloading the namespace and name and continue to use the subpath if this can make sense generally for the Go community and experienced Go folks ( @robpike ping! ).

The rationale is that in practice a good number if not a majority of public Go modules do end up fitting this approach: there is some repo or web site (Github, Gitlab, Bitbucket) that has mostly a two-level structure: "org or owner or user"/"name of project" and that level is typically what has a common set of attributes (ownership, team, release process, licensing, etc.) and there are "subpath" that extend inside this which are things effectively imported in Go.

To the best of my knowledge this ("org or owner or user"/"name of project") is also what to the Go toolchain would fetch in a workspace: the whole namespace/name would be fetched and specific subpaths would be selectively imported (I may be wrong there as I did not dive deep inside go get and Go modules code.)

Side note: IMHO there would not be many Package URL use cases to reference a specific deeply nested piece of Go code (e.g. using a subpath as suggested here) as opposed to the whole ns/name at once. What would be yours?

bradcupit commented 4 years ago

@pombredanne thank you so much for responding!

tl;dr: Though the existing purl spec works, I think we've accidentally made something impossible for our users.

  • module github.com/go-modules-by-example/submodules/a should be: pkg:golang/github.com/go-modules-by-example#submodules/a

That proposal works with all the existing code and examples. Users can take pkg:golang/github.com/go-modules-by-example#submodules/a and one of the purl libraries can split it to the various namespace, name, version, etc. parts. From there if a user wants to determine the Go module name, they can do so easily. We don't have to change anything in the spec or libraries.

Having said that, users (including the team I'm on) will write code that converts a Go module name and version to a purl string. This is easy for well known repos like github and bitbucket, but difficult for custom module names. Here's a real-world example:

v.io/x/ref/lib/flags/sitedefaults

Where does the parent module end and the submodule begin? What's the namespace and what's the name? We can't tell the answer to either without analyzing the Go module's git repo.

If users write their own code to do this should they set namespace = v.io, name = x, and subpath = ref/lib/flags/sitedefaults? In this particular case we can look at the Go module's git repo and see the parent module name is v.io so there is no namespace. That means our users would've chosen the incorrect namespace, and got a different purl string as the final output: pkg:golang/v.io/x#ref/lib/flags/sitedefaults vs pkg:golang/v.io#x/ref/lib/flags/sitedefaults (the # appears in a different spot).

Ultimately we can only guide our users and the onus is on them to split things up correctly. But I can't see a reliable way to split go modules into namespace and name without analyzing the module's git repo. And I'd assume most code converting a module name to a purl string will just have the module name string, not the entire git repo, as is the case for my company.

Idea

tl;dr just README changes, no code changes, but we percent-encode a lot more

Perhaps we should consider changing the README examples so they don't use namespace and instead only use name? And since we can't always tell where the parent module ends and the submodule begins we could also treat submodules the same as module names, instead of like subpaths. These two suggestions make it much easier for users to set the right values for namespace (which would always be blank now) and name, and then get consistent purl strings as the output. The downside: names are percent encoded, so the README purl strings would change. Examples:

Go module /submodule before after
github.com/gorilla/context pkg:golang/github.com/gorilla/context pkg:golang/github.com%2Fgorilla%2Fcontext
rsc.io/quote/v3 pkg:golang/rsc.io/quote@v3.0.0#v3 pkg:golang/rsc.io%2Fquote%2Fv3@v3.0.0
v.io/x/ref/lib/flags/sitedefaults pkg:golang/v.io#x/ref/lib/flags/sitedefaults pkg:golang/v.io%2Fx%2Fref%2Flib%2Fflags%2Fsitedefaults

I can't think of any other way to make these two problems easier on users. Thoughts?

andrewstein commented 4 years ago

@bradcupit I agree with your proposal — for go, there is not “namespace/name” concept. And if one is to drag submodules into the mix, there is no way to know, just looking at the import path, where the module ends and the submodule begins. Treating the whole thing as a single name is the only way as far as I can see.

athos-ribeiro commented 4 years ago

In @bradcupit propasal, would using subpath to point to subpackages (not declared as submodules) still make sense?

For instance, if a purl should point to v.io/x/ref, would it make any difference to assemble the purl as pkg:golang/v.io%2Fx%2Fref or as pkg:golang/v.io#x/ref? It seems like it would still make sense to use the first option and not use the subpath here since we could suffer from the same issue of not knowing where to split the components. However, would the second form still be valid?

In other words, should the approach be valid for both subpackages and submodules?

gotthardp commented 4 years ago

Please consider also readability and auditability of the PURL. From the usability perspective is pkg:golang/v.io#x/ref or even pkg:golang/v.io/x/ref (because that is the actual package name) more easily readable and auditable. The pkg:golang/v.io%2Fx%2Fref is perhaps easier to process for machines, but I prefer usability even if the implementation is a bit harder.

bradcupit commented 4 years ago

@athos-ribeiro said

would using subpath to point to subpackages (not declared as submodules) still make sense? ... would the second form still be valid?

Sorry for the late reply! It would make sense to me, assuming you need to know the subpath. I don't have a use case for that myself, but if you wanted to point to a particular file inside a go repo using the #subpath would still be valid.

The only reason we're percent encoding the / in the name is because we have to according to the purl spec. If there are no slashes in the name (because they've moved to the subpath and you're trying to point to a subpath instead of identifying a submodule) then there's nothing to percent encode.


@gotthardp said:

Please consider also readability and auditability of the PURL

Yeah, I personally hated what's in my suggestion. I very much prefer the version that's easier to read, meaning, the one without percent encoding, but I don't think the pretty version is realistic.

The pkg:golang/v.io%2Fx%2Fref is perhaps easier to process for machines, but I prefer usability even if the implementation is a bit harder.

That makes sense, and from the perspective of writing the purl-spec it makes sense too, but I think we have to consider how people are going to use the purl-spec. People will have the 'coordinates' of a package and want to convert that into a purl string.

For maven the coordinates are the groupId, artifactId, and version, which is enough to compute a purl string. For Go you can't just have the module name to generate the purl string: you'd need the whole url. So if you instead have a git repo URL as your coordinates you may not have enough info to generate a purl string with the current spec. It works for normal cases, like github repos, but it fails for odd cases like v.io/x/ref. So either you have to require both the VCS repo and the go module name, or you require just the VCS repo, then programmatically clone it and parse the go module name. Now you'd have enough info to generate the purl string.

Or, we just do the simple thing: require only the repo URL and stuff it all in the name field and percent encode it.

bradcupit commented 4 years ago

~I think the proposal we put forth violates a part of the purl spec:~

~namespace:~ ~...~ ~ When percent-decoded, a segment:~ ~ must not contain a '/'~

~So if we go with the solution proposed here we'd have to change the above part of the spec too, or make an exception for Go.~

bradcupit commented 4 years ago

@jdillon told me how the namespace encoding works (namespaces can contain slashes and we only encode what's between the slashes) -- plus I was totally wrong, we're proposing ditching the namespace for Go, so please ignore the previous comment.

bradcupit commented 4 years ago

He also mentioned it wasn't clear what this issue is proposing, so here's the shorter version of what @andrewstein proposed (and what I echo):

Problem For some repos (not github, not gitlab, but others) it's impossible to convert a repo URL + submodule paths or Go module name + submodule paths to a purl string.

Proposal

  1. stop using namespace in Go purl strings
  2. put the entire Go module name in the name, and percent encode it
  3. make minor updates to the spec, no code changes required

Example

pkg:golang/github.com%2Fgorilla%2Fcontext
pombredanne commented 3 years ago

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like. It could be made such that this is backward compatible for every package type. I shall say that Go's notion of a package which is really a subdirectory in some repo is not really amenable to clean identification (and leads to an explosion of the number of imports being tracked if you care to track things this way for software composition analysis )

bradcupit commented 3 years ago

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like

yes @pombredanne ! 💯 👏 🏆

maxhbr commented 2 years ago

Just as a snapshot how tools handle that today (for the example https://pkg.go.dev/github.com/russross/blackfriday/v2 in version v2.1.0):

tiegz commented 2 years ago

Another consideration could be remove entirely the notion of namespace and merge ns and name in a name component where you can have as many segments as you like

Just wanted to mention I've written a bunch of special-casing code for golang this week to try to parse the namespace. The difficulty lies in guessing the number of slashes in a namespace, e.g.

And there are even examples of go modules that dont have a namespace, e.g. gotest.tools (this is the full name of the module)

Knowing the number of slashes is important so you can split on them, and guess which part is the namespace, name, or subpath, but it's nearly impossible to do for go. For instance, it's not clear which of these cases is git.host/foo/bar/baz:

So splitting on slashes or even having prior knowledge of a VCS host is not really enough to make out the namespace vs the name. Given that, I agree with @bradcupit to squash the idea of a namespace for golang.

matt-phylum commented 1 year ago

I think the only possible answer is 1: pkg:golang/github.com/go-modules-by-example/submodules/a@v1.0.0, what Syft and SCTK are already doing. ORT is close but unnecessarily difficult humans.

2: pkg:golang/github.com/go-modules-by-example%2Fsubmodules%2Fa@v1.0.0 uses github.com as the namespace and go-modules-by-example/submodules/a as the name. If you join them together with a slash you get the module name expected by Go tools, so it could work, but it's difficult for humans to read because of the percent encoding, and easy for programmers to mess up by not being careful with their percent encoding or by relying on existing URL parsing code that may try to normalize the path component. Compared to 5, this version gives special meaning to path components from the second component onward compared to the first component (and assumes multiple components), which makes some sense for GitHub, but less sense for other sources. If everything is github.com/owner/repo, the owner/repo seems like a good name, but if you throw in v.io suddenly you have a problem.

3: pkg:golang/github.com/go-modules-by-example@v1.0.0#submodule/a uses go-modules-by-example as the package name, but github.com/go-modules-by-example is not the name of a package or module. It's the name of a GitHub user. You cannot use this name with Go tools, and the version v1.0.0 makes no sense in this context because users are not versioned.

4: pkg:golang/github.com/go-modules-by-example/submodule@v1.0.0#a looks like it could be right because github.com/go-modules-by-example/submodules is a Git repository and there is a go.mod file in the root of that repository, making it a module, but that module does not have a version v1.0.0 (In fact, it has no versions. Only a and b have versions.), and since it has no version v1.0.0 it cannot contain a subpath a in that version. Additionally, if done this way, because a is part of the real module name, it becomes difficult to refer to files within that module because you're combining part of the module name and the path within the module into the PURL subpath field.

5: pkg:golang/github.com%2Fgo-modules-by-example%2Fsubmodules%2Fa@v1.0.0 works as a slightly better alternative to 2, but with similar problems.

On the topic of pkg:golang/v.io#x/ref/lib/flags/sitedefaults: this is the correct PURL. pkg:golang/v.io/x/ref#lib/flags/sitedefaults is incorrect because v.io/x/ref is not a module. v.io is the thing that is versioned. If pkg:golang/v.io/x/ref#lib/flags/sitedefaults were correct, it becomes impossible to know where the module ends and the package begins. If you run go get -x v.io/x/ref, you can see that Go's own tools try downloading https://proxy.golang.org/v.io/x/ref/@v/list and https://proxy.golang.org/v.io/x/@v/list and only https://proxy.golang.org/v.io/@v/list exists. It's not a problem if you just want to install the module and can make multiple HTTP requests to find the answer, but if you want to determine whether PURLs like pkg:golang/v.io/x/ref@v1.0.0 (incorrect) and pkg:golang/v.io@v1.0.0 refer to the same module (eg you know something about pkg:golang/v.io@v1.0.0) you need to start making those same HTTP requests to external services. It also becomes more difficult to find the file sitedefaults, because once you resolve the module to its code you need to insert the x/ref path components that were removed from the end of the namespace+name into the subpath.

bradcupit commented 1 year ago

if you want to determine whether PURLs like pkg:golang/v.io/x/ref@v1.0.0 (incorrect) and pkg:golang/v.io@v1.0.0 refer to the same module ... you need to start making those same HTTP requests to external services

You are correct, but the situation is not ideal. This may be more of a problem of Go than purl, but many users would be surprised to find a purl library making external network calls while creating a purl string. It also wouldn't work in an air-gapped environment, and may cause performance/scale issues when processing hundreds of thousands of requests.

Mikcl commented 1 year ago

@matt-phylum thanks for outlining the different options available.

I see that you have raised a case for option 1, but I would like to raise the case for option 5 which seems to not be fully covered.

Using the same module example https://github.com/go-modules-by-example/submodules/blob/56ad34e87f3359a8dd4c781941829322edcf0ad6/a/go.mod#L1

here are what the different options purl, decoded namespace and name will be:

purl namespace name
1 pkg:golang/github.com/go-modules-by-example/submodules/a@v1.0.0 github.com/go-modules-by-example/submodules a
5 pkg:golang/github.com%2Fgo-modules-by-example%2Fsubmodules%2Fa@v1.0.0 nil github.com/go-modules-by-example/submodules/a

name

From the go documentation, the name is:

A module path is the canonical name for a module

Option 5 satisfies this, whereas option 1 does not?

as the module path here is github.com/go-modules-by-example/submodules/a (and will be written in this form in files such as go.mod, which scanner tools consume)

namespace

Given the module path must uniquely identify your module, the concept of a namespace for golang modules seem to not provide much utility? There is more discussion[1]( https://github.com/package-url/purl-spec/issues/63#issuecomment-1276924737) [2](https://github.com/package-url/purl-spec/issues/63#issuecomment-852235468).

Option 5 follows the outcome of these discussions (setting nil). Why keep the namespace that 1 proposes?

Tooling

it's difficult for humans to read [option 2 (and 5?)] because of the percent encoding,

I agree but from my understanding, purl is not designed explicitly for humans to read, it is primarily for tooling?

and easy for programmers to mess up by not being careful with their percent encoding or by relying on existing URL parsing code that may try to normalize the path component

That is (if anything) a limitation of the purl ecosystem and should not influence the "correctness" of what values arename and namespace?

overall i dont think the points mentioned raise significant obstacles for tooling.

tldr

I think option 5 is the most correct approach, as the go canonical name (module path) is represented in the purl name field.

However I am open to hear counterpoints or if something was missed?


Curious to hear how a consensus will be formed?

matt-phylum commented 1 year ago

I agree that with option 5 it's nice that the PURL name and the Go module name are the same, but I'm not sure it's worth the escaping to make that happen, and it would be the only package type to commonly contain %2Fs.

NPM has an optional namespace (scope), which is a critical part of the package name if present. Inpkg:npm/%40angular/animation, you must use the name @angular/animation or else you will get the wrong package. This has the same annoyance as in Go where the PURL namespace must be prefixed onto the PURL name to form the full package name as used by the package manager. However, the slash separator between the PURL namespace and the PURL name make the PURL look similar to the NPM name.

GitHub has a required namespace (owner), which is a critical part of the package name. In pkg:github/package-url/purl-spec, if you leave out package-url, the name no longer refers to this repository. It's the same as NPM except that the namespace is always required.

Maven has a required namespace (group), which is a critical part of the package name, but what is written in PURL as pkg:maven/org.apache.xmlgraphics/batik-anim would be written in Maven as org.apache.xmlgraphics:batik-anim. This is annoying because it uses a different separator when joining together namespace and name, but still uses the namespace field, creating PURLs that don't look like names that are used by the package ecosystem's native tools.

I think Swift has the exact same issue as Go here. The PURL spec gives examples like pkg:swift/github.com/Alamofire/Alamofire@5.4.3. It can probably have the same v.io case where there is no namespace at all.

It'd be nice if PURL didn't differentiate namespace from name since it seems like every package type will either have no namespace or it will have a unique definition for what the namespace means and how it must be used (often they are prefixed to the name, but some seem redundant with the repository_url qualifier). Since they always mean something different, is there a benefit to having them broken out? Maybe the spec could be rewritten such that the namespace is not part of it, without breaking compatibility, and future PURL libraries just treat it as part of the name. In that case, there would be no debate over whether a Go PURL should look like "pkg:golang/" + module_name (namespace+name) or "pkg:golang/" + encode(module_name) (name only). Without forcing the concept of namespaces into a package ecosystem that doesn't have them, there is no reason to encode the slashes to create option 5.

Mikcl commented 1 year ago

Have also enountered the Maven issue mentioned, and agree that a workaround was needed to meaningfully parse the purl.

I think its a symptom of:

it seems like every package type will have [its own implementation]

Which leads to https://github.com/package-url/purl-spec#problem

It seems like there are two options (referenced as 1 and 5) for the golang package type:



sschuberth commented 1 year ago

I think the only possible answer is 1: pkg:golang/github.com/go-modules-by-example/submodules/a@v1.0.0, what Syft and SCTK are already doing. ORT is close but unnecessarily difficult humans.

ORT isn't making it "unnecessarily difficult". As mentioned here already:

The only reason we're percent encoding the / in the name is because we have to according to the purl spec.

Esp. if you squash the namespace into the name (which is what ORT actually already does; we use an empty namespace in ORT's own data model for Go), you'll most likely end up with / in the name field that you have to escape according to the spec. IMO Syft and SCTK are simply not PURL-conforming.

sschuberth commented 1 year ago

[...] purl is not designed explicitly for humans to read, it is primarily for tooling?

[...] should not influence the "correctness" of what values arename and namespace?

I agree.

sschuberth commented 1 year ago

There is some divergence in the golang package type purls, which would be nice if reconciled but may cause downstream breaking-changes for their users? (cc @sschuberth from ORT tool which has diverged from the "example")

I don't have a problem with making breaking changes in ORT if needed to fix PURL correctness. Internally, ORT is using its own package ids anyway. However, as it stands I still believe ORT is doing it right (WRT the PURL specs), and Syft and SCTK are doing it wrong.