separate type from provider

package-url / purl-spec

A minimal specification for purl aka. a package "mostly universal" URL, join the discussion at https://gitter.im/package-url/Lobby

https://github.com/package-url/purl-spec

Other

690 stars 159 forks source link

separate type from provider #33

Open jeffmcaffer opened 6 years ago

jeffmcaffer commented 6 years ago

In the current spec the type of a package and the provider of a package are compressed into the type element. For example, type = npm implies npmjs.com as the provider. While this is true in general, it gets complicated when talking about a package type that can live on different providers (e.g., an npm on GitHub).

One possible path is to use the git-style + approach to get something like

pkg://npm+github/myorg/foo@a68381e

or more generallly

pkg:type[+provider][/namespace]/[name][@version]

This example indicates that there is an npm formatted entity on github in the foo repo in the myorg org with commit hash a68381e.

In this way, the current type element remains the type or format of the entity being located by the purl but the provider (if supplied) dictates the rest of the purl structure in the same way that the type does currently. If the provider is omitted then a spec'd default provider for the given type is used (e.g., npmjs for npm)

The purl spec should enumerate separately the set of types and providers with canonical values. For providers it is likely best if the values are as symbolic as possible. That is, use npmjs rather than npmjs.com. This simplifies the URLs for the user (npmjs.com? npmjs.org? www.npmjs.*?) and insulates URLs from changes in the provider's deployment.

pombredanne commented 6 years ago

@jeffmcaffer

In the current spec the type of a package and the provider of a package are compressed into the type element. For example, type = npm implies npmjs.com as the provider. While this is true in general, it gets complicated when talking about a package type that can live on different providers (e.g., an npm on GitHub).

The npm type implies two things:

the overall protocol to actually get to a package is that implemented in the npm client and the npm registry
a default public "repository" or "registry" that exists for this type (e.g. https://registry.npmjs.org or https://registry.npmjs.com here TBD which one is the canonical one exactly)

By provider I assume you mean some extra "protocol" used to fetch an actual package when this is not on a registry proper, such ass on Github or a git repo or similar?

Here I guess that both npm, but also Pypi and Rubygems have specific conventions to effectively reference a package that would be fetched from version control or some not-on-registry remote URL.

Now there is the other use case that you detail where a given package may have multiple incarnations.

E.g. a repo on GitHub that contains the source code for npm is also itself some npm. (And this is true also most if not all other package types).

The difficulty in this case is that there could be multiple ways to express reference a package:

this is effectively a repo in GitHub with a possible commit version or tag and a name that based on the user/repo name
this is also an npm as defined in its package.json with a name that may differ from the actual repo name and the same for the version: it may relate vaguely through convention to a tag or not. In all cases the version is what comes from package.json
it could be also consider as a "version control" URL (and not a purl)

So this could be resolved in a few ways:

using the approach your proposed, adding a provider would provide an indication of what the references are in this "provider". But this would lose the fact that the actual manifest contains possibly another name and version (and again the same is true for most package manifests)
another approach is to consider this package as identified by not one but multiple package URLs that would end up pointing to mostly the same thing: 2.1 something like pkg:npm/foo@1.2.3 that is eventually published at the npmjs registry 2.2 something like pkg:npm/foo@1.2.3?repository_url=http://another.registry.com that is pushed/discovered in another registry than the default public one 2.3 something like pkg:github/joeuser/foo-javascript@2342423ABC or pkg:github/joeuser/foo-javascript@version_1.2.3 that correspond to this same package in a github repo 2.4 yet something else which is not a purl like git+https://mygitrepo.com/foo-javascript.git@2342423ABC that is correspond that same package in some remote git repo identified with an SPDX-like VCS url

For your consideration... but I feel like it might be simpler to use multiple package URLs in this case rather than trying to combine multiple "personalities" in a type+provider.

In particular the same GitHub or VCS URL can have multiple personalities: a single repo may contain a top level package.json, a bower.json and a pom.xml and more. Or a nuget.spec and a package.json Or both RPM spec, Debian Control and a setup.py for Pypi or gempsec for Rubygems.

jeffmcaffer commented 6 years ago

Thanks for the detail @pombredanne. There is a differnet

There is a bit of a miscommunication here.

the overall protocol to actually get to a package is that implemented in the npm client and the npm registry

In this proposal the type talks about the shape of the thing identified by the purl. It is an npm, a gem, ... There may be a default way of getting the thing (e.g., talk npm protocol to npmjs) but that is just a simplification.

The provider is the place/protocol to use to get the thing. So npm+github would mean "use the git protocol to talk to github and get the thing at the supplied org/repo/commit and treat it like an npm"

In practice the name in the manifest at the end of the purl may well be different than that indicated by the purl. This is to be expected and is, for example, the way that npm works. That's ok as long as the other information represents an immutable value.

sschuberth commented 6 years ago

To take up the discussion again: @jeffmcaffer, I agree purl's current type should not hard-code what you call the "provider". And it doesn't: Just like you suggest it implies a default of e.g. "npmjs.org" in case of the "npm" type, but this can be overridden using the repository_url optional qualifier.

To me, type should describe the type of "server-side layout" and also imply the protocol. I.e. a "maven" type indicates that artifacts follow the Maven directory structure on the server, the server needs to be queried via HTTP(S), and the default server is Maven Central. If you want to retrieve an artifact from e.g. JCenter instead of the default of Maven Central, you need to set repository_url to "https://jcenter.bintray.com/".

So in a way, repository_url is what you call "provider". Or an I missing something?

jeffmcaffer commented 6 years ago

Thanks @sschuberth . I mostly agree with you and quite like the idea of unifying on purls.

There are still some lingering issues.

In our model the type is more about the form of the thing itself, that is, the actual package. It is completely independent of where/how you might get the thing. It IS an npm regardless of whether it came from my hard drive, GitHub, by unzipping a NuGet, or getting it from npmjs.
The provider is all about where/how you get the thing. That is independent of the type and of the server-side layout. The 'mavencentral' provider indicates that the client should use HTTP with a particular url structure to get the content.
repository_url as a name is too specific to capture provider. For example, npmjs.com and npmjs.org are the same place but different urls. Further, there may be different access patterns for different providers that would be hard to describe in what people normally think of as a "url" (for example, what is the url to indicate that the package comes from GitHub releases vs a GitHub repo?). It also ties up identity with physical location. Service names and urls change but the identity of the thing identified by a purl should be durable.
A really interesting case to consider is Go package imports and Go modules. I don't have all the details there. would really like to understand how purls and Go work together.
query params are less fun if you are trying to use the purl as an identifier. means that you generally have to parse, sort and filter the params for comparison. That is fine for rare things but not great for key elements. For example, are purl:maven/foo@1.3 an purl:maven/foo@1.3?repository_url=http://mavencentral.org the same (assuming mavencentral is the default provider for maven things)?

sschuberth commented 6 years ago

Thanks for the detailed explanation, you have some good points there. Seeing that Sontatype as already adopted purl I was about to do so for ORT, too, but now I feel these lingering issues need to be resolved first.

@jeffmcaffer, do you still think purl is "fixable" to capture what you need e.g. by using the + style from our original post, or is e.g. purl's use of query strings too cumbersome and you'd prefer a "reboot"? I'm just curious.

Also, is there a full spec of the identifier ClearlyDefined uses? I've found https://github.com/clearlydefined/clearlydefined/blob/master/docs/providers.md, but that's only about the provider part.

stevespringett commented 6 years ago

@jeffmcaffer FYI OWASP Dependency-Track and CycloneDX also have both adopted PackageURL.

But I'm confused/concerned about the provider. When you state

is all about where/how you get the thing

This is only one of many use-cases for PackageURL. I do not think having a provider is possible (or even desirable) in the specification. This should be the job of the application that is implementing PackageURL and is what Dependency-Track does for example.

Besides downloading the content (which itself can have different auth/proxy/network config issues so it's not as simple as just download), there's use-cases for identifying old/outdated versions of components using the repositories native APIs. For example, if I have a PackageURL for Apache Commons IO, I may not want to download it, but to query the repository for the current version of the thing to see if what I have is current or not. In this case, the provider example would be useless, especially since various repo implementations have various levels of API support. For example, I can download something from npmjs.org and query for the current version of the thing, and I can also do that with the other npmjs repos, but if I try to do it with a Sonatype Nexus 3 repo, it won't work even though it supports npmjs (it simply doesn't support the necessary api). I'm struggling to find benefits of having a provider as part of the spec without unnecessarily increasing complexity.

jeffmcaffer commented 6 years ago

The misunderstanding may be explained by something @pombredanne said a few comments ago

The npm type implies two things:

the overall protocol to actually get to a package is that implemented in the npm client and the npm registry

a default public "repository" or "registry" that exists for this type (e.g. https://registry.npmjs.org or https://registry.npmjs.com here TBD which one is the canonical one exactly)

What I am proposing is that this is actually provider. The type is the shape of the package, not how to get it. For example, you can look at https://github.com/foo/bar as a bunch of source (git repo), an npm (has a package.json), a maven thing (there is a POM), .... Each of these is a different type but they all have the same provider and you would use the same protocol (e.g, git clone) and only some of these types are in the canonical spot (git on github).

Put another way, the point of the provider is precisely to capture the vagaries of accessing the particular host like auth/proxy/.. and API level support. With the provider approach you would talk about nexus3 (using @stevespringett 's example) as a provider and npm as a type and write code that knows how to get a npm from nexus 3. If nexus has (or is missing) some APIs then that code can account for that. How would that be done with purl? To draw an analogy, provider is more like a URL scheme and type is more like content-type -- one talks about transport, the other about content.

There is also a fundamental question about identity: does the identity of a package include the place from which it comes? If so, then "provider" (call it what you want) is an integral part of the structure. The spec can allow for it to be undefined or default to the common provider but fundamentally it would still be part of the identity. If OTOH you don't want that characteristic, and assume all npms called "foo 1.0" are the same regardless of where they come from, that's ok but is a different identity model. While I get that

In the ClearlyDefined scenarios, we need to be able to get to and identify things that have different forms and are hosted in different places (npms as github repos, or github releases, or wrapped in a NuGet, or... )

I do not claim to have resolved all the corner cases. In fact, we very much would like to use purl as it is more robust in other dimensions. I am however having trouble figuring out how to pragmatically code/design with purl where we need/want the separation I'm describing.

@sschuberth, you should talk to @tsteenbe about this. He and I talked some and IIRC he perceived the same sorts of issues with the overloading of type in purl.

sschuberth commented 5 years ago

@jeffmcaffer I have a concrete question about how you see type and provider being used for Python. Some Python packages for a specific version are available as both .egg and .whl files. As I understand your approach, you'd then use "EGG" or "WHL" (or maybe "PythonEgg" or "PythonWheel") as the type, and "PyPI" as the provider (given that the packages are hosted at https://pypi.org/). Is that correct?

grv87 commented 5 years ago

@jeffmcaffer,

I think that you miss the point that having POM from Github repository doesn't mean that you have the package. You need compiled jar, and your git can't produce it. Downloading packages directly from source repositories is a feature available for some package types only.
Even if you've cloned repo with git and package can be installed from sources, you have to install it with original tool.

jeffmcaffer commented 5 years ago

The fundamental issue is that there is a difference between the format (aka type) of the thing you are getting (the package, git repo, tgz file, ...), the protocol you use to get it (e.g., npm install, git clone, wget, ftp, ...) and the location from which you get it. (Note that in the discussion above I only separated out protocol but should also have talked about location)

For example, you can get an NPM from many different places using different protocols (npm install, tgz fetch, git clone, ...). You can get a git repo by git clone or, in the case of GitHub, by downloading and exploding a zip of the repo. GitHub supports Maven protocol but also the downloading of jar releases.

purl does a good job of capturing type and it conceptually infers/defaults the protocol and location based on the type. That makes a lot of sense and keeps the simple case simple.

When the "package" does not conform to a norm, you can spec a repository-url query param to identify the location. You may still be able to infer the protocol but not always. The protocol is interesting because in a number of cases we want to do more than just download the package. We want to interrogate the repository for metadata (e.g., other versions to know if this is the latest). So knowing the protocol that can be used to talk to the location is super useful.

To illustrate, getting an npm from from GitHub could be done using the npm protocol to the GitHub package registry, by cloning the git repo, or by downloading a release. All locations are github.com/ but knowing the protocol allows us to talk different APIs to the location (e.g., we can ask GitHub for other releases, or for other branches/tags). We could use the location (ie.., repository-url) and reverse-engineers the protocol but a) lot of work at scale and b) subject to change.

As mentioned above, I don't claim to have the answers but would love to collaborate to figure it out.

blaumeiser-at-bosch commented 5 years ago

Going over the discussion so far, it seems that I have a different understanding of what a purl references. For my use cases, a purl references a unique id/coordinates for a component release unifying the different ways of how components are uniquely represented in a certain technology.

That means, that for one technology (e.g. maven) I would expect exactly one purl that references a version of a component. Having said that, a technology means basically a packaging type or a package manager.

There are several consequences:

A component release available in different package manager technologies has multiple purls, each of which uniquely identifies the component release, i.e., the sources from which the used component is built.
I do not care about the provider or source of the component, if it is the content, it has the corresponding purl of the package manager technology, since both the instances are the same thing, i.e., built from the same source code.
If the content is different, then it is a different release, i.e., it has a different purl
Whether I trust that the release has the content the purl indicates is not a matter of a purl but a matter of do I trust the source of this release.
The origin of the release is something I would like to know but this is out of scope of the purl spec.
The purl type identifies the technology with which I use the component in the build, aka as the package manager technology. I still can download the component from the source and build it on my own, but in the normal way I would build the corresponding library and use it as if I would download it directly from the internet: e.g., build a Jar, deploy it to a maven repo manager and reference it with the repo manager as provider in my pom.xml.
If I do not use it as described in 6, e.g., if I download the source and directly build it together with my sources, this still is the same content, but from my point, I would need to use a different purl, e.g., with type github, because the way I use it is different. I do not use the component as Maven component anymore, but as bunch of source files, so different purl.

So looking at your separation, @jeffmcaffer, for me: type and protocol is basically the same thing, it describes the way, how the component is "typically" used for the implied technology. location is an important information, but it is out of scope for a purl, because I want to identify the "essence" of a thing, independent of where it comes from, it might have multiple locations, it still is the same thing.

grv87 commented 5 years ago

@blaumeiser-at-bosch, what about different repositories? Should exactly the same jars, published with different coordinates to Maven Central and JCenter, have the same purl? If yes, who should assign it?

sschuberth commented 5 years ago

published with different coordinates to Maven Central and JCenter

Or even worse, published with the same coordinates to Maven Central and to JCenter/Bintray (or any other Maven-compatible repository).

I guess the underlying question is whether PURL should "only" identify the package in the sense of the contents of the package, i.e. I do care that it's the same package / file that I'm referring to, but I do not care where I got it from. Or should PURL also document where I got this copy of the same package from.

As PURL's goals are described as "reliably identify and locate software packages" (emphasis mine), I believe it should also document the where from. That would make PURL also more usable in the ClearlyDefined context where provenance matters, and I believe that's where @jeffmcaffer is coming from.

The next question would be how to integrate the where from, i.e. the provider, into the PURL standard. As I guess it's too late to define a dedicated field in the "base URL" for that, an option that was already discussed is to use URL qualifiers. And that option is actually not too bad: Users who do not care about the provider, but only about it being the same package (with the same hash) could just compare the base URL, whereas users who care about the provider need to additionally take the provider into account.

But if doing that we'd need a standardized (i.e. non-type-specific) name for a qualifier describing the provider, plus a documented default provider per type if no provider is specified.

stevespringett commented 5 years ago

I believe it should also document the where from

It does. Theres a default repo for most PURL types. For Maven, the default is Maven Central. If an artifact with the same coordinates exists in bintray, the repository_url qualifier should be added to differentiate the two as well as provide location information.

sschuberth commented 5 years ago

Theres a default repo for most PURL types.

I know. My point was that the name of the qualifier which specifies a non-default repo is not standardized, as repository_url is just an example. But re-reading the spec again it seems I was wrong, and using a qualifier named repository_url actually is in the standard.

blaumeiser-at-bosch commented 5 years ago

My point is, that I want to reference a content that has certain properties, a license, a copyright, ... That this content comes in different flavors, is another aspect, but this does not falsify the intention to identify the content. So coming back to your question @grv87, yes I would appreciate if I can easily detect that for two purls referencing the same content that this is the case, i.e., that there are equal or that the equality can be easily determined.

I do not get your statement concerning assigning, because my understanding is, that the purl is defined by the properties of the component, namely technology/package manager type, namespace, name and version, these properties build the main part of the purl.

Interestingly, there are also technical aspects: E.g., as far as I understand, maven assumes that two binary files with the same group id, artifact id and version contain the same content independent of the repository they are downloaded from. So for Maven the source does not matter or in other words, if this is not the case, a user of the component is getting into trouble, because Maven does not gurantee the delivery of the right alternative. In the Nuget universe we face the issue, that there are components delivered from nuget.org and some direct way within Visual Studio but the delivered files differ in their hash. I assume that they were built differently although with the same content. And same content means same metadata concerning licenses and copyrights, and that is form a compliance perspective the transformation I need to get the used licenses fulfilled.

The question is, what am I talking of, when I have the purl. The concrete instance of the open source component found somewhere in a repository, or the original open source component which was instantiated for deployment. IMO, I prefer to identify the original component and have additional metadata associated with this component, like known deployed instances and locations to get it from. Perhaps I am missing something, but for me this is not needed to identify the thing I want to identify.

grv87 commented 5 years ago

@blaumeiser-at-bosch, for Java, the package is jar, and it has no inherent namespace, name or version. The same jar provided by Maven could also be downloaded with Ivy from its own repositories, which have different metadata. Or it could be downloaded from direct URL as standalone jar, without any package manager at all. So, I don't see how PURL could provide unique URL covering all cases, without someone assigning it manually.

One could say that the jar is not important too, since Java only cares about packages and classes inside. Their content and licenses are what matters, not the (re)packed jar.

grv87 commented 5 years ago

there are components delivered from nuget.org and some direct way within Visual Studio but the delivered files differ in their hash. I assume that they were built differently although with the same content. And same content means same metadata concerning licenses and copyrights

I think this is not correct in general case. The same content can be built under different licenses. Maven has distribution subtag in license tag to specify that.

blaumeiser-at-bosch commented 5 years ago

@grv87 You are right, that the same piece of OSS code could have multiple PURLs each of which should identify this piece of software clearly, so yes, there are multiple ways of referencing the component.

But still I struggle with the notion that the same thing from different locations are different things. Even the wording is strange. 😃

The situation is different, if it is not the same thing, e.g., because it is the same piece of software licenses differently. In this case, I would absolutetly appreciate the two things to have different purls and ideally not only the location part of the purl but some substantial difference. If we cannot rely that one purl is referencing the component unambiguously, the whole thing with identifying dependencies and attach known metadata to the detected dependencies becomes very difficult.

jeffmcaffer commented 5 years ago

It will likely be hard for purl to reconcile package identity semantics across all the ecosystems. It feels even harder if we start mixing additional package metadata like licenses etc. Perhaps purls should be focused on locating the package rather than describing it.

Note that a difference in the purl case (vs the url case) is that purls help you locate the content and the metadata (e.g., registry info). If we just needed the content then a plain url to the zip, jar, tgz, ... would be fine. With a purl the user knows what protocol to talk to what registry and how to address the package of interest. Anything beyond that (e.g., copyrights, ...) can be left to the content of the package or registry supplied metadata about the package.