Documentation around using PURLs as unique identifiers

mlieberman85 commented 1 year ago

There is currently some confusion in the community of what practices someone should take in order to ensure that a PURL can only be resolved to a specific unique package. I don't know if unique identification is a core use case, but it is currently unclear what folks can do to help eliminate ambiguity. Some ecosystems like containers can easily use a sha256 which is suitably unique, but other ecosystems that might not be possible. Also today a lot of tools will generate purls that don't include suitably unique information.

A potential solution to this is in providing some documentation around best practices for using PURL for the identifier use case. I know that each ecosystem might be different, but some high level guidelines I think would help alleviate confusion.

nishakm commented 1 year ago

rnjudge commented 1 year ago

There's some discussion that happened way back that also might be relevant: https://github.com/package-url/purl-spec/issues/127

nishakm commented 1 year ago

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

nishakm commented 1 year ago

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url=<link to deb package> are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

bureado commented 1 year ago

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

The first example is certainly not the same as the rest. The OCI examples are the same by virtue of the ecosystem deciding to use a normalized content-sensitive digest as the version. The Docker examples might not be the same ones, since the Docker type allows tags as shown in your first example.

bureado commented 1 year ago

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url=<link to deb package> are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

I wouldn't consider those the same package for many purl use cases. In a "download attribution" use case, let's say you did apt download kdenlive and that got logged using pkg:generic, honestly I wouldn't find that an appropriate resolution. If you were using pkg:generic here to perhaps reflect an AppImage downloaded straight from the home page, then I can think of several use cases where I wouldn't want pkg:deb/kdenlive to be considered the same as pkg:generic/kdenlive.

mlieberman85 commented 1 year ago

To bring it back to the original it can be difficult to understand the intent of a given PURL. Is this PURL being given purely as a "locator" or a "unique identifier." It leads to a lot of ambiguity.

pombredanne commented 1 year ago

A PURL is a locator and a mostly unique way to identify a package. But this does not mean that there is a single unique PURL for a given package. This is pretty much the same way that any URL can locate a web page and act as a mostly unique identifier for the page. But there are can be multiple URLs that point to the same page, like with https://www.example.com and http://example.com

@mlieberman85 you wrote:

A potential solution to this is in providing some documentation around best practices for using PURL for the identifier use case. I know that each ecosystem might be different, but some high level guidelines I think would help alleviate confusion.

This makes sense. Some improved docs and also actual code examples that generate the PURLs.

@nishakm you wrote:

For example: pkg:docker/cassandra@latest, pkg:docker/cassandra@123456abcdef, pkg:docker/cassandra@sha256%123456abcdef, pkg:oci/cassandra@abcdef123456 and pkg:oci/my/local/cas@abcdef123456 are all the same thing. The pURL has to be detailed enough for a person or tool to have high confidence that they mean only one thing.

This was never the intent nor it is possible to have something that guarantees a unique identifier. You could use a checksum for this but this does not convey much beyond a unique content id. You could add a checksum as a qualifier, but it does not guarantee either that you cannot have two purls. If you want to treat two different PURLs as the same thing this is something for a system to handle, much like a URL crawler may have rules to treat two pages as being the same (in practice FWIW, this not based on exact content but on approximate high similarity in search engines)

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url= are the same package. The pURL tells you how the package was downloaded but doesn't indicate that it is the same package. My opinion is that pURL does the former extremely well, but to do the latter, we need some formality around how to craft a pURL.

The formalism exists: here the Debian packages is from Debian, so using a generic package type is misleading and incorrect, yet always possible.

nishakm commented 1 year ago

@pombredanne: A PURL is a locator and a mostly unique way to identify a package I understand PURL was never meant to be unique identifier. However, many tools and advisory databases use it as a unique (not globally) identifier. Furthermore, many in the PURL community see no problems with using it as an identifier. This makes it hard for tools to understand if one PURL means the same thing as another PURL. If one organization crafts a PURL in one way and another crafts a PURL in a way for a package that is basically the same, then this breaks interoperability.

If using PURL as an identification mechanism is not its core use case, then it should not be promoted as such.

I think @mlieberman85's suggestion on adding documentation on how one could indicate whether the intent of the PURL is to identify the package and not as a package location is a reasonable first step, but it doesn't solve the issue of PURLs not being the same across organizations or tools.

Reg: debian and other centralized packaging systems: Due to the standardized nature of the packaging systems, the location and the identity do have a chance to merge. So I will concede this point. However, it would be nice to organize the PURL types documentation by central package ecosystems rather than in alphabetical order to show that uniformity within each ecosystem.

pombredanne commented 12 months ago

@mlieberman85 I looked into the guac models at https://github.com/guacsec/guac/blob/068951468803a87d41592fd281c4e41d97fb16a6/pkg/assembler/graphql/model/nodes.go and "ontology" at https://docs.guac.sh/guac-ontology-definition/

If I get things correctly, your main identifiers are an artifact checksum and package nodes keyed by PURL (as a tree). There are a few things to consider, but hey, I do not know much about the model planned usage!

Track multiple PURLs for a package, because there can be more than one
Or ensure you track all the qualifiers for all the variants (say multiple Debian arch.)
Or supplement your Package graph with Artifact for content-based "unicity"

Side note: I see also a model for a "Source" node which made me think. For instance I see no conceptual difference between a Git checkout at revision, and a tarball of the mostly same, say as an original code archive for a Debian package. The file tree and archive may not be bit-for-bit identical, but would be the same content if you diff them abstracting minor things such a spaces, permissions and dates. I tend to prefer using a proper PURL for this rather than making your own up, but this is minor. Or consider instead the SPDX spec bits for VCS URLs at https://spdx.github.io/spdx-spec/v2.3/package-information/#77-package-download-location-field

nishakm commented 12 months ago

@pombredanne Here's an example of the way OSV uses pURLs: https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-38545.json I wonder if you have a recommendation of how the pURL pkg:apk/alpine/curl?arch=source can include alpine:v3.15, or even something from the result of apk info curl.

pombredanne commented 12 months ago

@nishakm I think that with things like:

Another example: pkg:deb/kdenlive and pkg:generic/kdenlive_etc_etc?download_url= are the same package.

... you may be focusing too much on on edge cases. Here the Debian project ensures that names and versions are unique within their realm. There is nothing more to it than that in PURL that just extends and builds upon the ecosystem of a package type id and naming coordination.

You also wrote:

However, it would be nice to organize the PURL types documentation by central package ecosystems rather than in alphabetical order to show that uniformity within each ecosystem.

I am not sure I get what you suggest... can you elaborate?

pombredanne commented 12 months ago

@nishakm you wrote:

@pombredanne Here's an example of the way OSV uses pURLs: https://storage.googleapis.com/cve-osv-conversion/osv-output/CVE-2023-38545.json I wonder if you have a recommendation of how the pURL pkg:apk/alpine/curl?arch=source can include alpine:v3.15, or even something from the result of apk info curl.

From a quick look, the way OSV handles it needs review. Do you mind to enter a separate issue for this topic?

pombredanne commented 12 months ago

@nishakm re: alpine, the alpine "release" stream would need to be clarified in the spec: https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#apk (which has another a wart: the apk of Alpine is NOT the apk of openwrt at all AFAIK they are different packaging formats entirely and not the same type.

pombredanne commented 12 months ago

May be another way to discuss unique identifier vs. locator is what I expanded in https://github.com/package-url/purl-spec/issues/257 ... A PURL is like an address

Copied from https://github.com/package-url/purl-spec/issues/257#issuecomment-1794429383

Here is a possible analogy that may not be too shabby! Say the PURL spec is like a the spec for an address book of people and places. 🧑‍🤝‍🧑 🏙️

Each package type is like a country or state and defines how you can identify and locate a place reasonably uniquely. Uniquely enough that the post can deliver the mail. In a city with well defined streets and street numbers, you get a precise location with the street name and number and may be an apartment number. In some cases you may want the address for a single person with its name, or the whole household. If someone is off the grid in the bayou or some isolated mountain, crafting a proper address may be more hairy and fuzzy. Worst case I may need GPS coordinates for these edge cases. I may also have many different ways to write an address or a name. Heck, some folks also live in orbit on the ISS and GPS will not work there!

mlieberman85 commented 11 months ago

@mlieberman85 I looked into the guac models at https://github.com/guacsec/guac/blob/068951468803a87d41592fd281c4e41d97fb16a6/pkg/assembler/graphql/model/nodes.go and "ontology" at https://docs.guac.sh/guac-ontology-definition/

If I get things correctly, your main identifiers are an artifact checksum and package nodes keyed by PURL (as a tree). There are a few things to consider, but hey, I do not know much about the model planned usage!

Track multiple PURLs for a package, because there can be more than one

Or ensure you track all the qualifiers for all the variants (say multiple Debian arch.)

Or supplement your Package graph with Artifact for content-based "unicity"

Side note: I see also a model for a "Source" node which made me think. For instance I see no conceptual difference between a Git checkout at revision, and a tarball of the mostly same, say as an original code archive for a Debian package. The file tree and archive may not be bit-for-bit identical, but would be the same content if you diff them abstracting minor things such a spaces, permissions and dates. I tend to prefer using a proper PURL for this rather than making your own up, but this is minor. Or consider instead the SPDX spec bits for VCS URLs at https://spdx.github.io/spdx-spec/v2.3/package-information/#77-package-download-location-field

This helps. I think one thing that we are also really trying to clarify is "intentionality." It can be difficult to understand when given a purl let's just say something like the following as a contrived example:

pkg:deb/foo and pkg:deb/foo@1.0 -- Based on the spec today it appears to be ecosystem dependent on how the first one should be interpreted. Is pkg:deb/foo mean latest without @1.0? Since from a temporarily perspective the first case might point to 1.0 but only temporarily. GUAC's use case is trying to both eliminate ambiguity but also highlight where there are unknowns to allow the end user to determine what action to take. It can be difficult to discern and some basic guidelines even if it is ecosystem dependent would be helpful.

package-url / purl-spec

Documentation around using PURLs as unique identifiers #242