Decoupling Location from Identity - Is this in the scope of purl?

package-url / purl-spec

A minimal specification for purl aka. a package "mostly universal" URL, join the discussion at https://gitter.im/package-url/Lobby

https://github.com/package-url/purl-spec

Other

666 stars 158 forks source link

Decoupling Location from Identity - Is this in the scope of purl? #127

Closed SteveLasker closed 2 years ago

SteveLasker commented 2 years ago

I'm opening this issue as a question, as the readme states purl is scoped to:

A purl or package URL is an attempt to standardize existing approaches to reliably identify and locate software packages.

While I recognize this has been a known pattern to assume a location for an artifact, this has also been a challenge for users that wish to take ownership of the content they depend upon. The realization that even common/shared/oss artifacts must be pulled from multiple locations, making an individual location a problematic concept.

A detailed post, with the context of the problem

Separating Identity From Location

TLDR:

From an SBoM community (CycloneDX and SPDX as examples), there's a desire to assure a reference within an SBoM points to a very specific artifact. It could be a container image, helm chart, wasm or other types where SBoMs are relevant. There are two dimensions to this decoupling:

Users want to take control of the public content they depend upon, and promote it to their private registry. In many/most cases, they want those private environments to be locked down, with no access to the internet. In this case, they need to import the container image, the sbom, scan results, signatures and other types with the image so they're all local to the private network/environment. Any egress is blocked by network restrictions.
The same content is becoming available on multiple registries. For instance, the exact same debian image can be pulled from docker hub or ecr public, with others "coming soon".

For 1, you might be willing to say "this is the debian image from docker.io", however, it's currently in my private registry. As long as the image is in the same repository as the SBoM, it can be resolved, and the URL part of the identifier is ignored as the debain image is said to be unique as it was in docker.io. Mirrors could also be resolved, maybe. For 2, it's far more challenging. If the exact same debian image is pushed to docker hub, ecr public, github, mcr and quay, what would the URL be? Should the debian owner have to pick one? Whether the user pulls the debian image from hub, ecr, or their private registry, the SBoM should be able to resolve the debian image, independently from where they got the image. The proposal in #123, focuses on decoupling location from identity. Location is an optional hint in the oci-artifact purl PR. What we've been trying to understand is whether purl, the specification, can decouple identity from location, or is purl always about identity & location?

If purl is always about location, then it makes consuming public content, in a secured & reliable manner, problematic as the same content will be available from multiple locations, and users want to pull the content into their private networks.

Is it possible to amend purls scope to assure unique identity, and make location an optional parameter so it could be used reliably for SBoM and security scan result pointers?

stevespringett commented 2 years ago

If purl is always about location, then it makes consuming public content, in a secured & reliable manner, problematic as the same content will be available from multiple locations, and users want to pull the content into their private networks.

Exactly. Which is why CycloneDX is heavily focused on security use cases, provenance being one of them. It's important to know where something was retrieved from, even if it was an internal mirror. When software is built/assembled, I'm not aware of any use case where the same artifact is retrieved from multiple repos and used. Just because something CAN be retrieved from multiple sources, doesn't mean it was. This is also where CycloneDX and SPDX vary dramatically in scope. As a pure BOM format, CycloneDX cares about what actually transpired, whereas SPDX (which I would not classify as an SBOM format, but it can be used for SBOM use cases) describes what something COULD be. A look at SPDX external references is all that's required for that to become obvious.

Internal repo servers (most of them) do not support:

OWASP SCVS 4.2 - Package repository contents are congruent to an authoritative point of origin for open source components
OWASP SCVS 6.1 - Point of origin is verifiable for source code and binary components
OWASP SCVS 6.2 - Chain of custody if auditable for source code and binary components

So although I can specify my internal repo in which I retrieved something from, many repo servers do not provide the full transparency necessary to achieve these basic requirements.

Is it possible to amend purls scope to assure unique identity, and make location an optional parameter so it could be used reliably for SBoM and security scan result pointers?

Purl is already heavily used in SBOM use cases today with 100K+ CycloneDX adopters - most of which utilize purl. So I think we have to better understand what specific SBOM use case is not being addressed today. As far as security scan result pointers, I would think location would be highly important here. The NVD is mostly irrelevant for identifying vulnerabilities in libraries today. Many SCA vendors either use purl directly or have some proprietary alternative which takes identity, location, and other metadata in mind when identifying known vulnerabilities in components. The NVD became mostly irrelevant because CPE could only describe vendor, name, and version. Purl goes down to the module level which is much more granular than what we had previously (and supported by the likes of Sonatype and Snyk). But we have the opportunity to further improve on that by incorporating location into the equation. For example, if I have a Java component that's published to Jitpack and the same artifact that's published to Maven Central and the one on Maven Central is the only one affected by a known vulnerability, that's really interesting information and a competitive advantage for the source of vulnerability intelligence that can go down to that level.

I do think however, there's an opportunity for an organization to "opt out" of using location by supporting a way to specify no default repo and no repo url. This might be useful for private repos. If an organization wants to practice security through obscurity, this would provide them a way to achieve that, but I would recommend this be an opt-in feature as we would not want to cripple location for the majority in favor of the few.

nishakm commented 2 years ago

If purl is always about location, then it makes consuming public content, in a secured & reliable manner, problematic as the same content will be available from multiple locations, and users want to pull the content into their private networks.

Exactly. Which is why CycloneDX is heavily focused on security use cases, provenance being one of them. It's important to know where something was retrieved from, even if it was an internal mirror. When software is built/assembled, I'm not aware of any use case where the same artifact is retrieved from multiple repos and used. Just because something CAN be retrieved from multiple sources, doesn't mean it was.

I'm curious how CycloneDX users will be able to find the correct endpoint within their intranet if they are using a docker installation configured to use a mirror, or a go installation that uses an internal proxy. What if they invoke a CLI tool which invoke the tools which do the fetching after several hops?

So although I can specify my internal repo in which I retrieved something from, many repo servers do not provide the full transparency necessary to achieve these basic requirements.

I don't think cloud native repos as the exist right now (mostly backended by S3 buckets) provide that kind of transparency either. All the user sees is a front facing API with no visibility into where exactly the artifact comes from. In fact, in most cloud native environments, folks don't care where the artifact is located as long as its integrity can be verified and it is signed.

rnjudge commented 2 years ago

I do think however, there's an opportunity for an organization to "opt out" of using location by supporting a way to specify no default repo and no repo url. This might be useful for private repos. If an organization wants to practice security through obscurity, this would provide them a way to achieve that, but I would recommend this be an opt-in feature as we would not want to cripple location for the majority in favor of the few.

As far as I can tell, this is what @iamwillbar was suggesting by making the repository identifier "strongly recommended" in his comment on the OCI proposal.

stevespringett commented 2 years ago

I'm curious how CycloneDX users will be able to find the correct endpoint within their intranet if they are using a docker installation configured to use a mirror, or a go installation that uses an internal proxy. What if they invoke a CLI tool which invoke the tools which do the fetching after several hops?

@nishakm The answer is in the question. Since CycloneDX has a data model optimized for highly automated pipelines, it’s elementary to enhance, correct, or merge SBOMs during the execution of the pipeline. Inspecting the configuration to discover use of a mirror and correcting purls in the SBOM is quite simple.

I believe Maven is one of only a few dependency management systems that also provide information on what repository each and every artifact was retrieved from. Most package managers are immature by comparison. But we should not see the immaturity of other systems as a reason to diminish the default behavior of purl.

In fact, in most cloud native environments, folks don't care where the artifact is located as long as its integrity can be verified and it is signed.

You’ve just described how SolarWinds happened - blind trust in something without transparency or methods to validate. We should not be interested in promoting practices that support continued use of bad practices. We need to support efforts that promote further transparency, even if it’s difficult for some ecosystems to achieve today.

As far as I can tell, this is what @iamwillbar was suggesting by making the repository identifier "strongly recommended" in his comment on the OCI proposal.

@rnjudge I could support the addition of a way to opt out or otherwise specify the location is unknown or not disclosed. I am not in favor of making location strongly recommended for the core purl spec as that one change would alter the meaning of every purl being used today. It’s a small, but breaking change.

Pinging @pombredanne for feedback.

coderpatros commented 2 years ago

Given @SteveLasker specifically references SBOM use cases I think this is a non-starter.

Unless maybe if you are only using purls in SBOMs for intellectual property use cases?

Where a package was retrieved from is important for software supply chain security use cases.

The component might be the same on disk. But the provenance is quite different. And, if you are trying to look at supply chain risk, this information is important.

I think it would be more beneficial to identify what Steve L thinks is missing from the existing purl format. If it's just a case of being able to remove the location information surely that can be done by the consumer when parsing?

nishakm commented 2 years ago

I think I understand @stevespringett's concern that the problem is differentiating what the location is from what the location could be. Therefore, I think this isn't a specification problem but a cloud native problem i.e. the notion of "it doesn't matter how the artifact got here as long as its checksum matches the published checksum and it is signed".

Even in the highly automated environments existing now, the client tools do not report the endpoints they are hitting in order to fetch an artifact. So something like the docker purl may tell you that an artifact was fetched using "docker like" ways, but not the actual endpoint. As such, the tools that generate a CycloneDX SBOM will not provide the true location. Just whatever the user has entered in their CLI like docker pull domain/repo:tag.

SteveLasker commented 2 years ago

Let’s tease apart a few things as I’m not suggesting this is problematic for all references. A source code repo is somewhat interesting if it can be disclosed. Most OSS project can, most products won’t What does location provide? Is it part of the identity? Or forensic information to analyze when something goes wrong? When is the SBoM generated, and can it be modified? If the SBoM is generated at the point of creation, you know the unique identity (hash or digest in oci artifact terms), but you don’t know which endpoint it may be pulled from. Small companies may distribute their artifact on docker hub, ecr, github and others. Which registry url would be used? Large companies like Microsoft build the artifacts on internal registries and distribute on mcr.microsoft.com, mcr.microsoft.cn (china) and a few air-gap clouds which I can’t even disclose the domain. If the location is part of the identity, then it doesn’t matter what the url is, and is this the best way to manage identity? If the purl is fixed at the time of SBoM creation and if the registry is the location, then what should be done in the above case where the same digest (hash) is published on multiple registries? When a consumer pulls the artifact, how do they know where to find the SBoM? If they have the SBoM, how do they find the artifact it references? When a user pulls both the artifact and the SBoM into their environment, and they can’t reach the endpoints they were originally published on, what should they do? Must they create another SBoM just to track its movement from one location to another? I realize this sounds like a chain of custody situation, and while true, and helpful for forensics, it’s not the optimal or even best or possible way for normal flows.

The beauty of digital bits is we can encode them, generate digests (hashes) of them and sign them with indecently verifiable signatures. As long as they remain the same, it doesn’t matter where they were. We know they weren’t tampered with and we know who attests to them with a signature.

This is how solar winds was “quickly” found to not be a distribution attack as the dlls were signed and they matched the digests generated from the build environment.

So, I get location is interesting from a forensics perspective. In many cases that internal, proprietary information can’t or shouldn’t be disclosed.

There is an issue with how to discover the SBoM from the point of an artifact that may not know it has an SBoM. When you have the SBoM, we need a way to know it’s referring to this very specific artifact.

pombredanne commented 2 years ago

@SteveLasker can you provide a concrete example of a purl that would be problematic? I cannot think of any.

Say you have a private Maven or Docker registry, and for the sake of arguments the same packages are available also in the public, default repository for this package type. For instance:

pkg:maven/org.mvel/mvel2@2.4.9.Final from https://repo1.maven.org/maven2/org/mvel/mvel2/2.4.9.Final/
pkg:docker/bitnami/redis@6.0.15-debian-10-r67 from https://hub.docker.com/layers/bitnami/redis/6.0.15-debian-10-r67/images/sha256-b65e55bbeb52644e10e05dd8396db85049eec3610a733258c27393ed1f60c431?context=explore
for the image I could use pkg:docker/bitnami/redis@sha256:b65e55bbeb52644e10e05dd8396db85049eec3610a733258c27393ed1f60c431 too

Say that my "private" image registry is at https://quay.io/ and the package at https://quay.io/repository/bitnami/redis/manifest/sha256:b65e55bbeb52644e10e05dd8396db85049eec3610a733258c27393ed1f60c431

And my "private" Maven repo is at https://repository.jboss.org/nexus/content/repositories/ea and the package at https://repository.jboss.org/nexus/content/repositories/ea/org/mvel/mvel2/2.4.9.Final/

Based on that I could:

use pkg:maven/org.mvel/mvel2@2.4.9.Final and pkg:docker/bitnami/redis@6.0.15-debian-10-r67 or pkg:docker/bitnami/redis@sha256:b65e55bbeb52644e10e05dd8396db85049eec3610a733258c27393ed1f60c431 and configure my tools and systems to use my "private repos" above. Which I would ALWAYS need to do somehow to use my internal private repos or registries (UNLESS there would be some transparent hidden network-wide internal proxy to the same effect?)
OR use pkg:maven/org.mvel/mvel2@2.4.9.Final?repository_url=https://repository.jboss.org/ .. and pkg:docker/bitnami/redis@6.0.15-debian-10-r67?repository_url=https://quay.io/

Either way works. When things are private, feel free to handle it as you like. The fact that there is default public repository for a package type means that this default does not need to show up in the URL and can be "transparently" overridden.

So I am not sure there is any issue here?

As a recap: a purl is a URL and a is locator, and all URLs are also URIs. Therefore a purl is also an identifier. The fact there is a default location for a type as opposed to something always hardcoded in the purl string means that you can also think of a purl as a pure identifier for private purposes. The global uniqueness of this identifier is something that's handled by default by the default public package repositories of each type. If you happen to use a content identifier (say a sha256) instead, that's fine too. If you do not publish your packages on the default public repo and you do not provide a way to locate it with a qualifier, that's OK too. Rather useless as none will be able to find it, but that's OK too.

pombredanne commented 2 years ago

@SteveLasker now your question is in the context of #123 and the context there is that you may not have a canonical, default reference repository location for a new OCI type. I see no issue having the default location be optional for a given purl type. This will be weird and problematic as someone with just a purl will not be able to get the package; and therefore this is less useful; short of a purl type-provided default repository URL location or a repository_url qualifier for a specific purl you will only be able to identify but not locate.

In the end, when there may be a need to get to the package code, you would always need some repo or registry location of sorts at runtime and/or fetch time to effectively retrieve the package archives. It can stay private

In recap, a package type default repository location or a repository_url qualifier is useful and desire to locate, but not essential to identify, especially if the identity is "strongly" content-defined like when you use sha256 as version. I have no problem with this. Weird but OK.

iamwillbar commented 2 years ago

Would this be a reasonable set of rules based on OCI's requirements:

A purl type MUST define all the components to uniquely identify a package - for some ecosystems the location may be required as part of this
A purl type SHOULD define a default location unless the package's identity is location independent (for example, ecosystems using hash-based content addresses)
A purl type SHOULD define a repository_url to override the default location (for ecosystems that require a location) or to provide a hint for where the package could be located (for ecosystems that don't require a location)
A purl type for ecosystems that don't require a location MUST provide sufficient information to verify that the located content is the correct content (for example, the version is a hash-based content address)
A purl type for ecosystems that do require a location MAY provide a way to verify that the located content is the correct content (for example, a content hash as an optional qualifier) [I initially felt this was SHOULD but downgraded to be closer to the spec as it stands today]

The intent of these rules is:

Location based ecosystems always have a location (either default or overridden)
Location independent ecosystems always have a way of verifying the content is the intended content
Location independent ecosystems can hint at a location
Location based ecosystems can provide a way of verifying the content is the intended content

Does this resonate with people?

iamwillbar commented 2 years ago

@stevespringett / @coderpatros I'm curious why location matters from a supply chain security perspective if you have a trusted content hash. If you can't trust the content hash, then adding location doesn't make it anymore (or less) trusted. If you trust the content hash, then adding location doesn't make it anymore (or less) trusted. Whatever trust you give to a content hash should be independent of location because it's the same content.

Extending on this, if you have sufficient provenance and pedigree information to say that a given content hash is trusted, from then on the location should be irrelevant. Inversely, if you have information that a content hash can't be trusted (or insufficient information to say you can trust it) then again the location should be irrelevant.

In the SolarWinds example, there was originally belief that a content hash was trusted, and new information came to light that a content hash shouldn't be trusted. Adding location wouldn't have mitigated or changed that outcome because it was the underlying content that became untrusted, not the location it was stored in. In fact, the IoCs provided were content hashes, independent of location.

coderpatros commented 2 years ago

@iamwillbar at a point in time a component that has been brought into some assembled piece of software, and where it was pulled from, may be "trusted".

But that package repository/mirror/whatever is part of your supply chain. And not everyone in the supply chain validates hashes/signatures along the way. So understanding where something came from can be useful.

Especially as the "same" component can be different, with a different hash, depending on where it was retrieved from. For example, nuget adds a signature to packages when they are uploaded. Some of those packages are also published as github release artifacts, distributed as part of an SDK, etc. Without knowing where it was retrieved from makes this situation very problematic.

Signatures don't solve the problem either. They are only good assuming the signing keys, or release process, hasn't been compromised.

iamwillbar commented 2 years ago

@coderpatros I completely agree that repositories, mirrors, etc. are part of the supply chain, but that's independent of whether purl must include a location to establish trust. In the specific OCI case that spawned this discussion the version is a sha256 hash of the content and it can be mirrored to any number of locations and that identity doesn't change. If the content is tampered with or changed intentionally that changes the identity of the package and consumers wouldn't inadvertently retrieve the new package. Likewise, information like vulnerabilities, pedigree, etc. can be attached to the content hash and used independent of the location because the identity of the package is intrinsically linked to its contents. Unnecessarily scoping information to the location may result in relevant information being missed because it's deemed not relevant.

This isn't to say that a location can't be provided as a hint of where you might be able to retrieve the image, that's perfectly valid, but having a location doesn't change the identity or trustworthiness of a content-addressed package.

coderpatros commented 2 years ago

Yeah, I just don't get how removing information helps. Wouldn't you just parse the purl to extract what you want for particular use cases? Or use the component hash from the SBOM?

iamwillbar commented 2 years ago

@coderpatros the proposal isn't to remove the concept of location but to acknowledge that for some ecosystems location does not make sense because it's not integral to the identity of the package. We're trying to define a new purl type where the concept of "location" doesn't make much sense, there is no default repository, content is often deployed to multiple repositories with no one of those being canonical, content can be moved between repositories and its identity doesn't change and it can be proven the content isn't tampered with.

For any ecosystem where two repositories could serve different content for the same identifier then location should be mandatory for the purl and I'd additionally recommend that a content hash be provided where possible. For ecosystems where the identity is intrinsically linked to the content regardless of location the location should be optional (but can be provided as a hint for retrieval but not as part of identity comparison).

tianon commented 2 years ago

I'm an outsider to purl (so please weigh this input accordingly :sweat_smile:), but in reviewing purl it doesn't seem like the OCI use case is really much (if any) different from say, hosting a Git repo at GitHub vs Bitbucket vs self-hosted -- the commit hash is going to be identical, the underlying data bits are identical, but the location is completely different (and as such, the purl reference is too).

nishakm commented 2 years ago

If I may provide another use case for security not based on location (and I am not, by any means, a security expert): zero trust systems do not track location but identities like owners and maintainers. In this case, the location may change through the supply chain, but the SBOM or something else can track signatures and attestations by owners.

iamwillbar commented 2 years ago

@tianon you're right that is a fitting example for the relationship between identity and location (and in fact was/is being discussed in #59). If we take these three (fictional examples):

pkg:github/package-url/purl-spec@244fd47e07d1004
pkg:github/package-url/purl-spec-fork@244fd47e07d1004
pkg:bitbucket/package-url/purl-spec@244fd47e07d1004

We know that this is the same commit because we know that the SHA1 hash of a Git commit is based on the commit and the state of the Git tree. I can push that same content to any number of repositories, and it is the same content. Though this isn't obvious from these examples because it requires that understanding of Git's internals and the knowledge that GitHub and BitBucket are both Git-based repositories.

If I want to know where the software is located it's important to know the github/package-url/purl-spec, github/package-url/purl-spec-fork, bitbucket/package-url/purl-spec portion. If I want to describe a specific piece of software (for example, to describe a vulnerability in it, or to describe its dependencies) then the location isn't relevant and it's the fact that it's Git commit 244fd47e07d1004 (or potentially the underlying tree id) that has the vulnerabilities or dependencies that is the most important.

One way to solve this would be to consider github and bitbucket subclasses of a generic git type, in this model the git type would behave like the oci type that's being proposed in that it would be location agnostic. The git type would have no default location and would provide an optional repository_url which could be used to provide a location hint.

pkg:git@244fd47e07d1004
pkg:git@244fd47e07d1004?repository_url=github.com/package-url/purl-spec

The github and bitbucket subclasses would behave like macros that can be expanded to a git purl type:

pkg:github/package-url/purl-spec@244fd47e07d1004 -> pkg:git@244fd47e07d1004?repository_url=github.com/package-url/purl-spec
pkg:bitbucket/package-url/purl-spec@244fd47e07d1004 -> pkg:git@244fd47e07d1004?repository_url=bitbucket.com/package-url/purl-spec

Since these macros can be easily converted to a common base class you can compare to see if they refer to the same software but you still have the option of knowing the suggested location of the software.

stevespringett commented 2 years ago

I'm curious why location matters from a supply chain security perspective

@iamwillbar

Signing keys get compromised all the time. If an adversary also has control over the repo (via lateral movement) in which artifacts are published and retrieved from, location matters. It would be important to know if I retrieved an artifact from a repo that was not compromised vs one that was. In both cases, signature verification would pass. Any org relying solely on signing verification is placing entirely too much trust in the PKI and surrounding infra. They will eventually be compromised.
Embargos and other organizational or political tools that prohibit the use of technology to a given country or region.
Project risk can also be evaluated based on location. If I know a location where something was retrieved from, I may be able to determine if any contributors are associated with nation state adversaries, known threat actors, or are a major contributor from embargoed countries.
And of course forensics which would need to reconstruct the software in play, configuration, and the location where things were retrieved from.

These are just the ones I can think of. I'm sure there are others...

I'm failing to find any good arguments for decoupling location from identity.

iamwillbar commented 2 years ago

@stevespringett we're not talking about signing or PKI at all though, we're talking about a content hash... if the content hash is in the purl (which is the proposal for oci) the content can't be changed without the purl becoming invalid (or continuing to point at the unmodified content). So if you retrieved the content from a compromised vs known good repository you are getting the same content because the content hash is the same, so in that scenario the compromise has no impact on the artifact being retrieved. The location doesn't improve our ability to know if the package is compromised for ecosystems that are based on content hashes.

No one is recommend location being removed, just identifying that location is not fundamental to all ecosystems. Purl should reflect the realities of the ecosystems it is trying to represent, rather than trying to impose requirements on them.

On your other points, I don't know a location of 'hub.docker.com' does anything to address the threats you outline. It doesn't tell us anything about the contributors, physical location, provenance, pedigree. It may be interesting for forensics but the content hash itself verifies that the package is unchanged in comparison to the purl.

stevespringett commented 2 years ago

we're not talking about signing or PKI at all though, we're talking about a content hash... if the content hash is in the purl (which is the proposal for oci) the content can't be changed without the purl becoming invalid (or continuing to point at the unmodified content).

I understand that. But the ask to decouple location from identity will affect every purl type, not just oci. That's a breaking change to the spec.

No one is recommend location being removed, just identifying that location is not fundamental to all ecosystems

Agreed. And most ecosystems have a default repo, and the ones that do not clearly state they do not in the purl type definition. Golang is a good example which reads: There is no default package repository: this is implied in the namespace using the go get command conventions. Why is this approach not good enough for oci? Why is oci so special that it needs to introduce breaking changes to all purl types? I do not understand this logic.

On your other points, I don't know a location of 'hub.docker.com' does anything to address the threats you outline.

That's a very specific example and you're likely correct, it likely will not. But we are talking about the core purl spec here, not a specific type. If you look at any package on https://packagist.org/ you can absolutely perform that type of analysis.

iamwillbar commented 2 years ago

@stevespringett I don't think @SteveLasker is suggesting that location is removed from all purls, I think he's encouraging purl to acknowledge that there are (and will be) purl types where location is not a fundamental part of identity and should be optional. For purl types where location is required to establish identity (which is true for most purl types that exist today) it should continue to be there.

For golang, deb, and rpm there's an implied location based on the namespace or distro, for generic there's a specific location in download_url.

If we're saying that it's OK for a purl type to have no default repository and to not require a repository_url or other location (except as an optional hint) as long as that uniquely identifies the package then I think that's what @SteveLasker is looking for.

nishakm commented 2 years ago

One way to solve this would be to consider github and bitbucket subclasses of a generic git type, in this model the git type would behave like the oci type that's being proposed in that it would be location agnostic. The git type would have no default location and would provide an optional repository_url which could be used to provide a location hint.

@iamwillbar I did submit a proposal for having "generic" purls in #126. Can this be a pattern that can be used for artifacts that don't follow the conventional centralized public repository pattern? This could also be a way for CycloneDX to not use such purls if they choose to.

nishakm commented 2 years ago

I did also bring up the use case of zero trust security which doesn't check endpoints but identities and signatures. If a client can verify the artifact's digest and signature, is there any need to check the location?

coderpatros commented 2 years ago

It's not up to me. But I would advise extreme caution in supporting this for things like the git example above. Git uses SHA-1 for commits. But it is not intended for security use cases. Which is why the hash is often truncated for convenience within a particular repo and is common practice.

Expanding on the git example above that drops location information...

pkg:github/package-url/purl-spec@4860cee is not the same as pkg:github/coderpatros/this-is-not-the-purl-you-are-looking-for@4860cee

Changing that purl to something like pkg:git@4860cee would make purl useless. I know it is a contrived example. But a more sophisticated and resourced adversary shouldn't have much trouble creating collisions for other, longer, truncated hashes.

nishakm commented 2 years ago

Changing that purl to something like pkg:git@4860cee would make purl useless. I know it is a contrived example. But a more sophisticated and resourced adversary shouldn't have much trouble creating collisions for other, longer, truncated hashes.

We are asking if the purl-spec maintainers are willing to allow for a pattern that describes "non-centralized" locations or "moving" locations. Some examples that come to mind for me:

Source code from a gitlab or perforce instance running in a DMZ shared between supplier and customer
Source code moving from one internal perforce instance to another internal gitlab instance
Source code hosted on github.com which is actually a read-only mirror of another SCM

In the end, it is the same source code, probably coming from the same people, but just moved from one hosting mechanism to another.

Personally, I don't think relying on "common knowledge" to triangulate a location is a good security practice. As you know, locating any of these artifacts, including the ones CycloneDX is using now, also relies on user configuration which purl does not capture. Maybe trying to figure out how to accurately describe "artifact movement" is something in scope for the package-url folks?

SteveLasker commented 2 years ago

@stevespringett, I completely agree that location is required to find the package if you need to find the package reference.

The root of this issue is:

At the time the SBoM is created, do you know the location the package will be pulled from?

The location is required, but it would be provided at runtime, for that particular environment:

public --> wabbit-networks-shared-internal --> alpha-team --> staging-for-public--> wabbit-networks-public-registry
       \-> wabbit-networks-shared-internal --> delta-team / 
public --> acme-rockets-shared-internal --> dev-team-a --> staging-for-prod-env-foo --> prod-env-foo
       \-> acme-rockets-shared-internal --> dev-team-b --> staging-for-prod-env-bar --> prod-env-bar

By separating identity from location, you can use the unique identity to match the intended package when the location is provided dynamically, at runtime, for that environment. The SBoM has the identity

It's not that location isn't important, it's that it's not known when the purl is persisted.

iamwillbar commented 2 years ago

@coderpatros completely agree on your comments re: Git, I wasn't entirely clear, my intent was more to show that the oci content addressing vs location isn't unique.

stevespringett commented 2 years ago

At the time the SBoM is created, do you know the location the package will be pulled from?

In some ecosystems, yes, that information is known and exposed to build-time plugins. In most ecosystems, this information is not exposed today. It depends on the maturity of the ecosystem. As newer ecosystems become more mature, I would expect location information to become more widely avaialble.

The scope of purl is to identify and locate a software package.

Wouldn't a urn:pkg:... syntax be more appropriate if only the identity is wanted and location is either ignored or not applicable?

As stated in the other ticket, I would be open to the idea of a reserved word for repository_url. Something along the lines of repository_url=unspecified, which would override any default and tell the consumer that a repo is unknown, undisclosed, or simply not applicable.

However, decoupling location from identity, as the title of this ticket states, fundamentally changes what a purl is. Purl is useful because it includes location. I can see the need to only care about the identity part. Many SCA vendors use purl for identity only today but have the intent on advancing their capabilities to include location in the future.

iamwillbar commented 2 years ago

@stevespringett I'm curious why the reserved word is needed, as you pointed out earlier golang (and others) don't have a default repository and don't require a repository_url (although one can still be optionally provided). Can we just codify that approach and allow each ecosystem to define whether location is required or not (again, providing very strong guidance on when it is OK to omit a location).

tianon commented 2 years ago

On your other points, I don't know a location of 'hub.docker.com' does anything to address the threats you outline.

In my experience, it's not the fact that it's on "Docker Hub" that specifically provides useful data alone about whether or not something is trustworthy, but more specifically the combination of "Docker Hub" and "specific Docker Hub user or organization which I trust" that does so (or even, maintainer of this particular image / repository within a larger organization).

Project risk can also be evaluated based on location. If I know a location where something was retrieved from, I may be able to determine if any contributors are associated with nation state adversaries, known threat actors, or are a major contributor from embargoed countries.

For example, pulling docker.io/user-i-trust/foo:bar is going to be very different from docker.io/known-malicious-user/foo:bar (very similarly to GitHub and any other "public" hosting site), so from the perspective of a long-time Docker user and OCI member/maintainer, I really don't see any way we can reasonably conclude that OCI is a special case here?

The only thing that really sets the OCI objects apart from these other package types (from what I can see) is that they're designed to have an explicit content-addressable digest that is commonly used to refer to and fetch them, and that digest remains unchanged (by design) when the content is moved from one registry to another. However, you cannot ask a registry for said content without also knowing the full repository path from which to fetch it (docker.io/foo/bar@sha256:deadbeef).

SteveLasker commented 2 years ago

Just to build on what @tianon notes above. Location is important, but it's dynamic, based on where the content is at that point, and the registry/path is/should be passed for that specific environment.

The difference between docker.io/user-i-trust/foo:bar and docker.io/known-malicious-user/foo:bar will be the entity that signed the content.

There will be a collection of "docker certified" content where you can trust Docker to have done some amount of verifications and vetting. No different than you trust content from Apple (or not)

This goes back to the point I made above where location is a dynamic thing.

But, I'll also cover the point about trust should not be placed in any one thing. Security is never good enough when there's only one barrier. There must be multiple elements. The digest/hash and the signature are used to assure the artifact is what it was, and who is attesting to it.

It depends on the maturity of the ecosystem. As newer ecosystems become more mature, I would expect location information to become more widely available.

This is actually the point I'm trying to interject here. I'd suggest we should be investing in establishing identity, independent from the location. Because mature ecosystems will embrace content must be promoted into environments that are secure, and from within those environments, users can't and shouldn't be capable of going back. Knowing how the content got from point a - z isn't important for verification, as it's relatively easy to circumvent. It's more important to be able to verify somethings unique identity, and owning entity, regardless of how the content got to z.

When something goes wrong, it is interesting to know when/where it got mutated. But, how does storing that information in an SBoM solve that?

coderpatros commented 2 years ago

Maybe an aligned package URI spec could cater for this?

SteveLasker commented 2 years ago

Closing as #123 accounts for the location being an option, which makes purl super useful for identifying an artifact, independent from the location. Purls are persisted and location is dynamic for where the artifact may be at any point in time. Thanks everyone for the great discussion, Steve