Open jeffmcaffer opened 6 years ago
@jeffmcaffer
In the current spec the type of a package and the provider of a package are compressed into the
type
element. For example, type =npm
implies npmjs.com as the provider. While this is true in general, it gets complicated when talking about a package type that can live on different providers (e.g., an npm on GitHub).
The npm
type implies two things:
the overall protocol to actually get to a package is that implemented in the
npm
client and the npm
registry
a default public "repository" or "registry" that exists for this type (e.g. https://registry.npmjs.org or https://registry.npmjs.com here TBD which one is the canonical one exactly)
By provider I assume you mean some extra "protocol" used to fetch an actual package when this is not on a registry proper, such ass on Github or a git repo or similar?
Here I guess that both npm, but also Pypi and Rubygems have specific conventions to effectively reference a package that would be fetched from version control or some not-on-registry remote URL.
Now there is the other use case that you detail where a given package may have multiple incarnations.
E.g. a repo on GitHub that contains the source code for npm is also itself some npm. (And this is true also most if not all other package types).
The difficulty in this case is that there could be multiple ways to express reference a package:
So this could be resolved in a few ways:
using the approach your proposed, adding a provider would provide an indication of what the references are in this "provider". But this would lose the fact that the actual manifest contains possibly another name and version (and again the same is true for most package manifests)
another approach is to consider this package as identified by not one but multiple package URLs that would end up pointing to mostly the same thing:
2.1 something like pkg:npm/foo@1.2.3
that is eventually published at the npmjs registry
2.2 something like pkg:npm/foo@1.2.3?repository_url=http://another.registry.com
that is pushed/discovered in another registry than the default public one
2.3 something like pkg:github/joeuser/foo-javascript@2342423ABC
or pkg:github/joeuser/foo-javascript@version_1.2.3
that correspond to this same package in a github repo
2.4 yet something else which is not a purl like git+https://mygitrepo.com/foo-javascript.git@2342423ABC
that is correspond that same package in some remote git repo identified with an SPDX-like VCS url
For your consideration... but I feel like it might be simpler to use multiple package URLs in this case rather than trying to combine multiple "personalities" in a type+provider.
In particular the same GitHub or VCS URL can have multiple personalities: a single repo may contain a top level package.json, a bower.json and a pom.xml and more. Or a nuget.spec and a package.json Or both RPM spec, Debian Control and a setup.py for Pypi or gempsec for Rubygems.
Thanks for the detail @pombredanne. There is a differnet
There is a bit of a miscommunication here.
the overall protocol to actually get to a package is that implemented in the npm client and the npm registry
In this proposal the type
talks about the shape of the thing identified by the purl. It is an npm, a gem, ... There may be a default way of getting the thing (e.g., talk npm protocol to npmjs) but that is just a simplification.
The provider
is the place/protocol to use to get the thing. So npm+github would mean "use the git protocol to talk to github and get the thing at the supplied org/repo/commit and treat it like an npm"
In practice the name in the manifest at the end of the purl may well be different than that indicated by the purl. This is to be expected and is, for example, the way that npm works. That's ok as long as the other information represents an immutable value.
To take up the discussion again: @jeffmcaffer, I agree purl's current type
should not hard-code what you call the "provider". And it doesn't: Just like you suggest it implies a default of e.g. "npmjs.org" in case of the "npm" type
, but this can be overridden using the repository_url
optional qualifier.
To me, type
should describe the type of "server-side layout" and also imply the protocol. I.e. a "maven" type
indicates that artifacts follow the Maven directory structure on the server, the server needs to be queried via HTTP(S), and the default server is Maven Central. If you want to retrieve an artifact from e.g. JCenter instead of the default of Maven Central, you need to set repository_url
to "https://jcenter.bintray.com/".
So in a way, repository_url
is what you call "provider". Or an I missing something?
Thanks @sschuberth . I mostly agree with you and quite like the idea of unifying on purls.
There are still some lingering issues.
type
is more about the form of the thing itself, that is, the actual package. It is completely independent of where/how you might get the thing. It IS an npm regardless of whether it came from my hard drive, GitHub, by unzipping a NuGet, or getting it from npmjs.provider
is all about where/how you get the thing. That is independent of the type and of the server-side layout. The 'mavencentral' provider indicates that the client should use HTTP with a particular url structure to get the content. repository_url
as a name is too specific to capture provider
. For example, npmjs.com and npmjs.org are the same place but different urls. Further, there may be different access patterns for different providers that would be hard to describe in what people normally think of as a "url" (for example, what is the url to indicate that the package comes from GitHub releases vs a GitHub repo?). It also ties up identity with physical location. Service names and urls change but the identity of the thing identified by a purl should be durable.Thanks for the detailed explanation, you have some good points there. Seeing that Sontatype as already adopted purl I was about to do so for ORT, too, but now I feel these lingering issues need to be resolved first.
@jeffmcaffer, do you still think purl is "fixable" to capture what you need e.g. by using the +
style from our original post, or is e.g. purl's use of query strings too cumbersome and you'd prefer a "reboot"? I'm just curious.
Also, is there a full spec of the identifier ClearlyDefined uses? I've found https://github.com/clearlydefined/clearlydefined/blob/master/docs/providers.md, but that's only about the provider part.
@jeffmcaffer FYI OWASP Dependency-Track and CycloneDX also have both adopted PackageURL.
But I'm confused/concerned about the provider
. When you state
is all about where/how you get the thing
This is only one of many use-cases for PackageURL. I do not think having a provider is possible (or even desirable) in the specification. This should be the job of the application that is implementing PackageURL and is what Dependency-Track does for example.
Besides downloading the content (which itself can have different auth/proxy/network config issues so it's not as simple as just download), there's use-cases for identifying old/outdated versions of components using the repositories native APIs. For example, if I have a PackageURL for Apache Commons IO, I may not want to download it, but to query the repository for the current version of the thing to see if what I have is current or not. In this case, the provider example would be useless, especially since various repo implementations have various levels of API support. For example, I can download something from npmjs.org and query for the current version of the thing, and I can also do that with the other npmjs repos, but if I try to do it with a Sonatype Nexus 3 repo, it won't work even though it supports npmjs (it simply doesn't support the necessary api). I'm struggling to find benefits of having a provider as part of the spec without unnecessarily increasing complexity.
The misunderstanding may be explained by something @pombredanne said a few comments ago
The npm type implies two things:
- the overall protocol to actually get to a package is that implemented in the npm client and the npm registry
- a default public "repository" or "registry" that exists for this type (e.g. https://registry.npmjs.org or https://registry.npmjs.com here TBD which one is the canonical one exactly)
What I am proposing is that this is actually provider
. The type
is the shape of the package, not how to get it. For example, you can look at https://github.com/foo/bar as a bunch of source (git repo), an npm (has a package.json), a maven thing (there is a POM), .... Each of these is a different type but they all have the same provider and you would use the same protocol (e.g, git clone) and only some of these types are in the canonical spot (git on github).
Put another way, the point of the provider is precisely to capture the vagaries of accessing the particular host like auth/proxy/.. and API level support. With the provider approach you would talk about nexus3
(using @stevespringett 's example) as a provider and npm
as a type and write code that knows how to get a npm from nexus 3. If nexus has (or is missing) some APIs then that code can account for that. How would that be done with purl? To draw an analogy, provider is more like a URL scheme and type is more like content-type
-- one talks about transport, the other about content.
There is also a fundamental question about identity: does the identity of a package include the place from which it comes? If so, then "provider" (call it what you want) is an integral part of the structure. The spec can allow for it to be undefined or default to the common provider but fundamentally it would still be part of the identity. If OTOH you don't want that characteristic, and assume all npms called "foo 1.0" are the same regardless of where they come from, that's ok but is a different identity model. While I get that
In the ClearlyDefined scenarios, we need to be able to get to and identify things that have different forms and are hosted in different places (npms as github repos, or github releases, or wrapped in a NuGet, or... )
I do not claim to have resolved all the corner cases. In fact, we very much would like to use purl as it is more robust in other dimensions. I am however having trouble figuring out how to pragmatically code/design with purl where we need/want the separation I'm describing.
@sschuberth, you should talk to @tsteenbe about this. He and I talked some and IIRC he perceived the same sorts of issues with the overloading of type
in purl.
@jeffmcaffer I have a concrete question about how you see type
and provider
being used for Python. Some Python packages for a specific version are available as both .egg
and .whl
files. As I understand your approach, you'd then use "EGG" or "WHL" (or maybe "PythonEgg" or "PythonWheel") as the type
, and "PyPI" as the provider
(given that the packages are hosted at https://pypi.org/). Is that correct?
@jeffmcaffer,
git
can't produce it.
Downloading packages directly from source repositories is a feature available for some package types only.git
and package can be installed from sources, you have to install it with original tool.The fundamental issue is that there is a difference between the format (aka type
) of the thing you are getting (the package, git repo, tgz file, ...), the protocol
you use to get it (e.g., npm install, git clone, wget, ftp, ...) and the location
from which you get it. (Note that in the discussion above I only separated out protocol
but should also have talked about location
)
For example, you can get an NPM from many different places using different protocols (npm install, tgz fetch, git clone, ...). You can get a git repo by git clone
or, in the case of GitHub, by downloading and exploding a zip of the repo. GitHub supports Maven protocol but also the downloading of jar releases.
purl does a good job of capturing type
and it conceptually infers/defaults the protocol
and location
based on the type
. That makes a lot of sense and keeps the simple case simple.
When the "package" does not conform to a norm, you can spec a repository-url
query param to identify the location
. You may still be able to infer the protocol
but not always. The protocol is interesting because in a number of cases we want to do more than just download the package. We want to interrogate the repository for metadata (e.g., other versions to know if this is the latest). So knowing the protocol
that can be used to talk to the location
is super useful.
To illustrate, getting an npm from from GitHub could be done using the npm protocol to the GitHub package registry, by cloning the git repo, or by downloading a release. All locations are github.com/location
(ie.., repository-url) and reverse-engineers the protocol but a) lot of work at scale and b) subject to change.
As mentioned above, I don't claim to have the answers but would love to collaborate to figure it out.
Going over the discussion so far, it seems that I have a different understanding of what a purl references. For my use cases, a purl references a unique id/coordinates for a component release unifying the different ways of how components are uniquely represented in a certain technology.
That means, that for one technology (e.g. maven) I would expect exactly one purl that references a version of a component. Having said that, a technology means basically a packaging type or a package manager.
There are several consequences:
So looking at your separation, @jeffmcaffer, for me: type and protocol is basically the same thing, it describes the way, how the component is "typically" used for the implied technology. location is an important information, but it is out of scope for a purl, because I want to identify the "essence" of a thing, independent of where it comes from, it might have multiple locations, it still is the same thing.
@blaumeiser-at-bosch, what about different repositories? Should exactly the same jars, published with different coordinates to Maven Central and JCenter, have the same purl? If yes, who should assign it?
published with different coordinates to Maven Central and JCenter
Or even worse, published with the same coordinates to Maven Central and to JCenter/Bintray (or any other Maven-compatible repository).
I guess the underlying question is whether PURL should "only" identify the package in the sense of the contents of the package, i.e. I do care that it's the same package / file that I'm referring to, but I do not care where I got it from. Or should PURL also document where I got this copy of the same package from.
As PURL's goals are described as "reliably identify and locate software packages" (emphasis mine), I believe it should also document the where from. That would make PURL also more usable in the ClearlyDefined context where provenance matters, and I believe that's where @jeffmcaffer is coming from.
The next question would be how to integrate the where from, i.e. the provider, into the PURL standard. As I guess it's too late to define a dedicated field in the "base URL" for that, an option that was already discussed is to use URL qualifiers. And that option is actually not too bad: Users who do not care about the provider, but only about it being the same package (with the same hash) could just compare the base URL, whereas users who care about the provider need to additionally take the provider into account.
But if doing that we'd need a standardized (i.e. non-type-specific) name for a qualifier describing the provider, plus a documented default provider per type if no provider is specified.
I believe it should also document the where from
It does. Theres a default repo for most PURL types. For Maven, the default is Maven Central. If an artifact with the same coordinates exists in bintray, the repository_url
qualifier should be added to differentiate the two as well as provide location information.
Theres a default repo for most PURL types.
I know. My point was that the name of the qualifier which specifies a non-default repo is not standardized, as repository_url
is just an example. But re-reading the spec again it seems I was wrong, and using a qualifier named repository_url
actually is in the standard.
My point is, that I want to reference a content that has certain properties, a license, a copyright, ... That this content comes in different flavors, is another aspect, but this does not falsify the intention to identify the content. So coming back to your question @grv87, yes I would appreciate if I can easily detect that for two purls referencing the same content that this is the case, i.e., that there are equal or that the equality can be easily determined.
I do not get your statement concerning assigning, because my understanding is, that the purl is defined by the properties of the component, namely technology/package manager type, namespace, name and version, these properties build the main part of the purl.
Interestingly, there are also technical aspects: E.g., as far as I understand, maven assumes that two binary files with the same group id, artifact id and version contain the same content independent of the repository they are downloaded from. So for Maven the source does not matter or in other words, if this is not the case, a user of the component is getting into trouble, because Maven does not gurantee the delivery of the right alternative. In the Nuget universe we face the issue, that there are components delivered from nuget.org and some direct way within Visual Studio but the delivered files differ in their hash. I assume that they were built differently although with the same content. And same content means same metadata concerning licenses and copyrights, and that is form a compliance perspective the transformation I need to get the used licenses fulfilled.
The question is, what am I talking of, when I have the purl. The concrete instance of the open source component found somewhere in a repository, or the original open source component which was instantiated for deployment. IMO, I prefer to identify the original component and have additional metadata associated with this component, like known deployed instances and locations to get it from. Perhaps I am missing something, but for me this is not needed to identify the thing I want to identify.
@blaumeiser-at-bosch, for Java, the package is jar, and it has no inherent namespace, name or version. The same jar provided by Maven could also be downloaded with Ivy from its own repositories, which have different metadata. Or it could be downloaded from direct URL as standalone jar, without any package manager at all. So, I don't see how PURL could provide unique URL covering all cases, without someone assigning it manually.
One could say that the jar is not important too, since Java only cares about packages and classes inside. Their content and licenses are what matters, not the (re)packed jar.
there are components delivered from nuget.org and some direct way within Visual Studio but the delivered files differ in their hash. I assume that they were built differently although with the same content. And same content means same metadata concerning licenses and copyrights
I think this is not correct in general case. The same content can be built under different licenses. Maven has distribution
subtag in license
tag to specify that.
@grv87 You are right, that the same piece of OSS code could have multiple PURLs each of which should identify this piece of software clearly, so yes, there are multiple ways of referencing the component.
But still I struggle with the notion that the same thing from different locations are different things. Even the wording is strange. 😃
The situation is different, if it is not the same thing, e.g., because it is the same piece of software licenses differently. In this case, I would absolutetly appreciate the two things to have different purls and ideally not only the location part of the purl but some substantial difference. If we cannot rely that one purl is referencing the component unambiguously, the whole thing with identifying dependencies and attach known metadata to the detected dependencies becomes very difficult.
It will likely be hard for purl to reconcile package identity semantics across all the ecosystems. It feels even harder if we start mixing additional package metadata like licenses etc. Perhaps purls should be focused on locating the package rather than describing it.
Note that a difference in the purl case (vs the url case) is that purls help you locate the content and the metadata (e.g., registry info). If we just needed the content then a plain url to the zip, jar, tgz, ... would be fine. With a purl the user knows what protocol to talk to what registry and how to address the package of interest. Anything beyond that (e.g., copyrights, ...) can be left to the content of the package or registry supplied metadata about the package.
In the current spec the type of a package and the provider of a package are compressed into the
type
element. For example, type =npm
implies npmjs.com as the provider. While this is true in general, it gets complicated when talking about a package type that can live on different providers (e.g., an npm on GitHub).One possible path is to use the git-style
+
approach to get something likeor more generallly
This example indicates that there is an npm formatted entity on github in the
foo
repo in themyorg
org with commit hash a68381e.In this way, the current
type
element remains the type or format of the entity being located by the purl but theprovider
(if supplied) dictates the rest of the purl structure in the same way that thetype
does currently. If the provider is omitted then a spec'd default provider for the given type is used (e.g., npmjs for npm)The purl spec should enumerate separately the set of types and providers with canonical values. For providers it is likely best if the values are as symbolic as possible. That is, use
npmjs
rather thannpmjs.com
. This simplifies the URLs for the user (npmjs.com? npmjs.org? www.npmjs.*?) and insulates URLs from changes in the provider's deployment.