opencontainers / artifacts

OCI Artifacts
https://opencontainers.org
Apache License 2.0
224 stars 54 forks source link

Provide clear definition of what is an "artifact" #32

Closed Silvanoc closed 1 year ago

Silvanoc commented 3 years ago

There's no clear definition about what's an artifact, although the meaning can be found when reading the specification. It's nevertheless meaningful having a clear definition, because the term "artifact" is extremely overloaded.

I tend to call the files of an artifact "artifacts" themselves and then started calling the artifact "artifact set", until realizing the nonsense. But it should illustrate the need for a clear statement of what is an artifact.

What probably also makes it misleading is the plural in the name: artifacts. There's the OCI Image Specification (notice the singular in the name) and the OCI ArtifactS Specification (notice the plural), what sounds like specifying how to use a registry to store an "artifact set or bundle" containing multiple artifacts.

SteveLasker commented 3 years ago

Hi @Silvanoc, Just pruning issues. I agree artifact(s), plural is a good question, not sure what action we'd take. To your point, you could push a collection of artifacts (files) as one or more blobs/layers, within a single manifest. The approach is more about pushing|pulling an artifact to a registry.

Does the latest update to the distribution-spec help? Open to suggestions on what we'd add.

Silvanoc commented 3 years ago

IMO the naming issue cannot be fixed now, and it's not that it important that it needs fixing. It simply contributes to the confusion: what is an artifact? each single file you can store? each set of files referenced by a manifest?

If you look at ORAS documentation then you realize what it means (roughly speaking a set of files being referenced by a manifest). What shows that a definition in the definitions section is needed. I can try to contribute one, but since I'm not a native speaker, it might result in a weird formulation :slightly_smiling_face:

SteveLasker commented 3 years ago

Thanks @Silvanoc, let me see if I can make some proposed tweaks.

SteveLasker commented 2 years ago

@Silvanoc, if you'd like to make a PR, as this was your suggestion, I or @mikebrow can help with some translation.

SteveLasker commented 2 years ago

To put some thoughts around "what is an artifact"?

Let me try and put something here as an iterative draft that will turn into a pr:

An artifact is an object stored in an OCI Distribution registry that a user reasons over. Some examples include:

The above are independent artifacts. A new set of reference artifacts are also evolving:

The user interacts with these artifacts by named references. Either through a tag (registry.io/namespace/repo:tag) or a digest (registry.io/namespace/repo@sha256:abc123) The artifact has a lifecycle the user can manage, including push, pull, delete, copy

The artifact is backed by a manifest. When a user requests a tag or digest, they are requesting a manifest, by which the client can negotiate how to fetch the blob(s) that represent that artifact.

Each of those artifacts are represented as one ore more blobs. A layer, as defined in the oci image spec, is an ordinal collection of blobs. The blobs of the artifact represent an optimized way to deliver the content, by which a user needs to interact with that artifact. The distribution spec 1.0 represents the ability to store individual artifacts through the oci image manifest. The oci index is a means to define multi-arch representations of that single artifact.

The ORAS Artifacts-spec takes the concept of artifacts to a next level. The focus of a Secure Supply Chain has prompted the need to add signatures and SBoMs as attestations to content stored in an OCI Distribution-spec based registry. These "attestations" are also considered reference type artifacts, meaning they also have a lifecycle that can be pushed, discovered, pulled, copied and deleted. The subtle difference with reference types is they are considered extensions to the artifact they reference. They don't necessarily have an independent lifecycle.

The reference type artifact may not have any blobs. A user may simply wish to add annotations to an existing artifact

In practical code terms, an artifact = a manifest. A manifest may be an oci image manifest, an oci index, or an oras artifact manifest.

I'm sure this needs more tweaking, but wanted to put it out for discussion.

Silvanoc commented 2 years ago

@Silvanoc, if you'd like to make a PR, as this was your suggestion, I or @mikebrow can help with some translation.

Sorry, I was trying to get some clarity to be able to answer with a 'yes' or a 'no', but time flies!

An artifact is an object stored in an OCI Distribution registry that a user reasons over. Some examples include:

container image
helm chart
opa policy

The above are independent artifacts. A new set of reference artifacts are also evolving:

sbom
scan result
signature

The user interacts with these artifacts by named references.

The word "artifact" appears 4x in the above quoted text. But what's an "artifact" within the scope of the specification? I've found right now in ORAS a new term to make it all even more confusing :slightly_smiling_face: "subject artifact".

Let me refer to the ORAS quick-start documentation to better understand what I mean.

In that guide an SBOM and signature are being associated to a container image. But what if I want to associate an SBOM and signature to an archive? Are the archive, SBOM and signature "artifacts"? Then it's "repository" the name for the whole? And what's the name for "repo":"tag"?

IMO the fact that different terms for things without a clear separation and without a definition (e.g. the variables REPO and IMAGE and the --subject) is a symptom for fuzzy terminology, obvious for insiders but confusing for outsiders.

Perhaps it's obvious for a native speaker what you mean with that a user reasons over, but not for me :grimacing:

SteveLasker commented 2 years ago

Thanks @Silvanoc, Time does fly, and I figured I’d just give a try. The multi lingual aspect is a good point. The definition of names has been a challenge amongst the group. I wrote this a while back: https://stevelasker.blog/2020/02/17/registry-namespace-repo-names/ This artifacts repo has a subset, as we couldn’t get consensus:: https://github.com/opencontainers/artifacts/blob/main/definitions-terms.md

To your point question,

But what if I want to associate an SBOM and signature to an archive?

I believe you mean a tar archive as a blob? A tar archive [layer | blob] is a good thing to define. Personally, I believe a blob, by itself is just a blob, without a definition. Meaning, the manifest is what makes a blob an artifact. An artifact may only be one blob. But, blobs without a manifest are likely deleted in most registries, although the spec doesn’t define lifecycle or garbage collection, nearly every production registry does implement lifecycle management. Those that don’t implement lifecycle management wouldn’t be opposed to having it, they may just not use it.

This is a bit of a tangent, but a blob without a manifest might get an inferred manifest, just like pushing an artifact without a tag becomes :latest

subject: This is a née term, and one that has been changed multiple times. As defined in the oras artifacts spec, it a way to define a reference between two artifacts in a registry. We did initially call it reference, but it wasn’t clear if it was a parent:child relationship or child:parent relationship. The term reference didn’t infer direction. Most recently, it was changed from subjectManifest to subject. The oras artifacts spec today limits references to other manifests. But, we do see a future where an sbom or signature could be associated with a blob.

Do any of the above references help that I could use to clarify these better?

Silvanoc commented 2 years ago

I don't have the feeling that we are on the same page :slightly_smiling_face: Perhaps I'm not making my point clear enough.

The more I look at other related projects trying to get some clarity, the more I've the impression that the whole terminology throughout all OCI-specifications is not very consistent and can be hard for outsiders. Let me illustrate with references to different related projects what I mean to make sure that we agree on the problem space before moving to the solution space (which isn't easy). I think that I'll give it a try for the solution space with PRs.

ORAS

See the ORAS pushing artifacts with multiple files documentation:

Just as container images support multiple "layers" represented as blobs, ORAS supports pushing multiple layers.

pushing multiple layers? Although the artifacts specification and ORAS "misuse"/"reuse" the layers, I suppose that Blob (as defined by the distribution specification) is the right term. But since we are talking about multiple files here, then probably something like ORAS supports pushing multiple files would be more correct.

OCI distribution specification

I'll refer to the terms found in the definitions of the OCI-distribution specification with bold letters and starting with a capital letter.

I've noticed that a big longer than 1 year ago the specification was modified to make it content agnostic (decoupling from images), still the widest extended use of the distribution specification is container images. Therefore I've confronted the definitions with container images.

When I pull a container image like docker.io/library/debian:10, what's the name of the different parts here? I know the answers to some of the following questions, but I'll question anyway everything that isn't well defined:

To make it even more confusing, if we confront the above analysis with Dockers documentation of docker tag, we can see a conflict. Since for Docker library would be the repository name (instead of the namespace) and debian the image name (instead of the repository).

Here the discussion is about how to address the content or how to name the different parts on an artifact name, but not about the names of the individual parts of an artifact. Anyway, I think this comment has become confusing enough to close it here and leave the names of the individual parts of an artifact for another comment...

SteveLasker commented 2 years ago

Thanks @Silvanoc, I completely agree with your points about the evolved naming and it is confusing. I view this thread as a means to clarify these. There are inconsistencies across the various specs as they've been written at different times, where the use-cases have evolved. But, to be fair to the original designers of the distribution spec, the intent of the distribute-spec was to distribute anything through a CAS model. The work done by the OCI Artifacts and ORAS Artifact-spec maintainers was possible because of this forward-thinking design of the distribution-spec.

I actually think this discussion is super helpful to help create some clarity.

Some additional thoughts:

pushing multiple layers? Although the artifacts specification and ORAS "misuse"/"reuse" the layers, I suppose that Blob (as defined by the distribution specification) is the right term. But since we are talking about multiple files here, then probably something like ORAS supports pushing multiple files would be more correct.

Can you explain why you think ORAS has a misuse? The distribution spec supports storing blobs. A blob can be anything. In most cases, it's a tar collection of 1 or more files. But, it doesn't actually need to be a tar. What ORAS does is support multiple blobs. Each blob can be 1 or more files. I think I see some places where we can better clarify that, or happy to see some PRs around clarifying that as well.

The image spec uses the term layers, as they are ordinal overlays. The distribution-spec doesn't actually care if they're ordinal or not. That's a concept specific to the image-spec. For instance, the helm spec creates two blobs. One for the chart, and one for the provenance file. These aren't ordinal and can be pulled separately. That's the beauty of the distribution spec, as it's far more generic.

10 is the Tag, what according definition is a Manifest identifier (do I as a user want to get a manifest or an artifact/image?). Tags are mandatory, being latest the default one.

Minor clarification: tags aren't actually mandatory. The distribution spec does support pushing and pulling manifests by digest only. This is one of the things the oras artifacts-spec takes advantage of. Tags are a way to make human readable references or to have a higher level artifact viewed.

As to, "what is an artifact": I'd suggest it's the thing a user wants to focus on. All the details around blobs, layers, annotations, signatures, scan results, sboms are supporting information, to that primary artifact. The end-user wants to push, discover, verify, pull, delete an artifact.

What do you think?

Silvanoc commented 2 years ago

I completely agree with your points about the evolved naming and it is confusing. I view this thread as a means to clarify these. There are inconsistencies across the various specs as they've been written at different times, where the use-cases have evolved. But, to be fair to the original designers of the distribution spec, the intent of the distribute-spec was to distribute anything through a CAS model. The work done by the OCI Artifacts and ORAS Artifact-spec maintainers was possible because of this forward-thinking design of the distribution-spec.

As a developer myself I can fully understand it. Keeping consistency among projects written on different times is very difficult to accomplish.

I don't pretend to blame anybody. Only to bring the inconsistencies to the surface. Since as an engineer/architect I know how important clarity and terminology are.

I actually think this discussion is super helpful to help create some clarity.

I hope so and I'm glad that you see it so.

Can you explain why you think ORAS has a misuse?

Perhaps "misuse" is too negative. What I mean is that the distribution specification hasn't been written from scratch, but derived from a software that was originally written to handle container images and not artifacts. The generalization from "specification for the distribution of container images" to "specification for the distribution of artifacts" isn't easy to accomplish while the names existing for historical reasons (e.g. layers) are kept. And I see keeping backwards compatibility as a very good decision.

Minor clarification: tags aren't actually mandatory.

You're right. I rather meant either a digest or a tag (being the tag 'latest' if nothing is specified). The tag or digest in the end gives versioning support (not necessarily as numerical versions).

As to, "what is an artifact": I'd suggest it's the thing a user wants to focus on. All the details around blobs, layers, annotations, signatures, scan results, sboms are supporting information, to that primary artifact. The end-user wants to push, discover, verify, pull, delete an artifact.

I agree on this. My initial intention was to focus only on what is an "artifact", but trying to understand the whole I've fallen down the rabbit hole...

Let me try to write a couple of small PRs with formulation proposals here and there.

Silvanoc commented 2 years ago

Although I've touched varied concepts and definitions in this issue, I've provided PR #50 focusing on its original goal. I don't expect it to be accepted on the first try, but I'd prefer to move the discussion from this thread to that concrete proposal to better focus it.

I might open separate issues for the different inconsistencies I've found (many of them on other projects) trying to understand what is an artifact.

SteveLasker commented 2 years ago

And I see keeping backwards compatibility as a very good decision.

Agreed. Can you call out where you think backwards compact is a challenge?

I agree on this. My initial intention was to focus only on what is an "artifact", but trying to understand the whole I've fallen down the rabbit hole... Let me try to write a couple of small PRs with formulation proposals here and there.

I think the discussion really helps surface some things that can be clarifed.

As for the overall design, I'd pull @stevvooe, @dmcgowan, @vbatts, @mikebrow in as some of the original folks that could add more context.

mikebrow commented 2 years ago

wrt this repo.. opencontainers/artifacts currently means "an OCI repository for Artifact Guidance Documents"

mikebrow commented 2 years ago

wrt the question what is an artifact .. that is defined in the distribution spec: https://github.com/opencontainers/distribution-spec/blob/main/spec.md#definitions

Silvanoc commented 2 years ago

@mikebrow I found that definition some days ago and I failed trying to find it again the day over :disappointed:

I think it's very good and a reference to it could be added here.

I still miss following definitions (probably in the distribution specification):

  1. A name for what you use to unequivocally call an artifact, something like my-registry:port/silvanoc/something:tag (what could be a Fully Qualified Artifactory Name or similar)
  2. A definition for what is a repository, being used first in the Pulling manifests section of the distribution spec. I understand what is meant, but I think a definition in the list would be helpful.
  3. A definition for what is a reference, being used first in the Pulling manifests section of the distribution spec (where it's explained also, but I'd add it to the definition list)
  4. A definition for what is a name/namespace, being used first in the Pulling manifests section of the distribution spec

For those of you involved into the specifications is obvious what is meant, by those terms. But not for outsiders.

For me (outsider, but with some knowledge about the implementation details) terms like "repository", "namespace", "reference",... aren't clear until I read through the specification.

Consumers of the implementations (in fact I started wondering how to call things playing around with ORAS) are missing those definitions. Instead of having the implementations defining them in a possible diverting way, I'd rather fix them at the specification and let the implementations refer to them.

I can make PRs with those definitions on the distribution spec, if you agree they would be useful.

mikebrow commented 2 years ago

yes I missed some.. we can work them in your pr.. Will be away for a couple weeks will review help define them when I get back. Cheers!

Silvanoc commented 2 years ago

I find the artifact definition of the distro spec very good, but it probably needs to be changed to accommodate to the artifacts specification, by removing the need of having a config file (as already remarked by @SteveLasker in this comment

Additionally it's unclear if 'image indexes' are somehow covered by any of the definitions. It'll become a more obvious problem once the future scope of the artifacts spec has become present.

vbatts commented 2 years ago

On 05/11/21 09:52 -0700, Silvano Cirujano Cuesta wrote:

I find the artifact definition of the distro spec very good, but it probably needs to be changed to accommodate to the artifacts specification, by removing the need of having a config file (as already remarked by @SteveLasker in this comment

Additionally it's unclear if 'image indexes' are somehow covered by any of the definitions. It'll become a more obvious problem once the future scope of the artifacts spec has become present.

The name image-index was never good. "manifest-list" was much better.

mikebrow commented 1 year ago

Mission for artifacts is moving to the image and distribution specifications.. and this repo is being archived. I you believe more is needed please reopen in image or distribution!