oras-project / oras

OCI registry client - managing content like artifacts, images, packages
https://oras.land
Apache License 2.0
1.49k stars 179 forks source link

Simplify ORAS default experience #178

Closed SteveLasker closed 1 year ago

SteveLasker commented 4 years ago

The ORAS client is intended as a reference implementation of the ORAS libraries to implement a targeted client library, persisting content into an OCI Artifact enabled registry.

For example:

mything push registry.examplecom/mything:v1 mythingfile.blah

The current default experience (v0.8.1) of the ORAS client generates annotations to assist with missing metadata. While it's interesting to demonstrate the use of annotations, it leaves the default experience a little messy.

The goal of this proposed change would:

Current Default Experience

ORAS assumes files are passed and persisted directly into the registry as a blob. Since the files are directly persisted, without tar, the only way to know what file name should be created for oras pull is to create an annotation on the layer:

The new norm

A default ORAS push should be clean, without annotations, and a default layer.mediaType of application/tar:

Likewise, the pull should not require any additional parameters:

Change of behavior

This does quantify a change in behavior. However, we will make a version change, and we have not yet achieved a v1.0 status. Users that require the previous behavior can either use previous versions, or use additional flags provided.

Change to the go libraries: TODO: further analysis

Default files as tar

The ORAS client already supports .tar for directories. It actually uses .tar+gzip By using .tar as the default, ORAS can remove the "org.opencontainers.image.title" annotation as the tar file format will maintain the file names.

Once .tar is the default, passing a directory will also remove the "io.deis.oras.content.unpack": "true" annotation as this be the default behavior.

Maintaining backwards capability

Note, this is backwards capability, but not backwards compatibility as we wish to change the default experience, while maintaining the blob persistance as files. However, we do see value in persisting files as the raw file. A collection of files passed in as a directory reference would need to grouped into a tar file.

file:// parameter will be added to indicate the file will be persisted without change. The following example maintains backwards capability for artifact1.txt, while the new default behavior is applied to artifact2.txt and the subDir/. Also note, the default for file:// would convert to application/tar, however the user can still pass in their desired mediaType.

Note "application/vnd.oci.image.layer.v1.tar" SHALL be reserved for the actual runtime OCI image spec.

deitch commented 4 years ago

Essentially, this makes every layer 100% compatible with existing OCI image layers? Since they all are tar or tar+gzip blobs?

Funny, I did a lot of work to get some VM images stored in a registry, then used the annotations to figure out filenames, purpose, format, etc., largely based on oras's scheme. In the end, I also had to support something a little more legacy-driven, and ended up not far from here.

So how would we handle indications of content type? If everything is application/tar, and there are no annotations except in the special case, how do I know if the file is a docx, or qcow2 disk image, etc.? Or is the assumption that this is just for the CLI, to simplify it, but the library use case would continue to support annotations and custom media types?

I get how this might simplify the initial CLI experience, but it also weakens the ability of the CLI to showcase the power of oras.

On the other hand, if we really are looking to simplify it, why not set the media-type to application/vnd.oci.image.layer.v1.tar which is compatible with just about all existing registries?

SteveLasker commented 4 years ago

So how would we handle indications of content type? If everything is application/tar

Hopefully, the blobs are implementation details. Think of how a file system breaks up a large file. You save the file as .docx, as they can be large. The underlying implementation breaks it up. For OCI Artifacts, we break things up in some logical groupings, while file systems break things up dynamically for where they fit. But, in either case, the user just saves the thing as application/vnd.microsoft.word.config.v10 or .docx and that's what they care about. In this case, if they're all .tar files, it's ok as the digests make them unique and they are associated with the manifest that references them. Now, that's the common scenario we've been thinking about, but I keep reading into your questions that you're working on something a bit more expansive and interesting.

As for why we don't suggest using application/vnd.oci.image.layer.v1.tar, it's because the image-spec actually has a layered filesystem between the layers. layer1 overlays files that are in layer0. That's not the case with all artifact types. I write about it a bit here: Layer Content Format

deitch commented 4 years ago

But, in either case, the user just saves the thing as application/vnd.microsoft.word.config.v10 or .docx and that's what they care about. In this case, if they're all .tar files, it's ok as the digests make them unique and they are associated with the manifest that references them.

That is the part I don't get. As long as the media-type is application/vnd.microsoft.word.config.v10, then it is in the manifest, and I know it. I can wrap that up in a tar (works for me), optionally compress. But if I just use application/tar, then how do I know that the (one or more) files in the tar archive are supposed to be a Microsoft Word doc, and not, e.g. an ELF binary named oras? That is the purpose of media-type; it appears to be losing it.

    {
      "mediaType": "application/tar",
      "digest": "sha256:603ea6780f25eb11c6c733af50b10f7f45ef4121d044f3eea011fa9a128884e4",
      "size": 21
    },

As for why we don't suggest using application/vnd.oci.image.layer.v1.tar, it's because the image-spec actually has a layered filesystem between the layers. layer1 overlays files that are in layer0.

Fair enough, but easy enough to use without violating anything.

awakecoding commented 3 years ago

@SteveLasker @deitch I am late to the discussion here, but I am looking for the best way to specify the filename or directory name that corresponds to each layer for my own implementation. The "org.opencontainers.image.title" annotation appears to be doing exactly that, but I am not particularly interested of replacing the annotation by tarballs, especially for small files.

I am looking at using OCI Artifacts to store general purpose files: video files, but also ISO files or VHDX at some point in the future. While compression can be great, it also has the problem of changing the digest for the compressed artifact. For instance, if your intent is to reference a non-public Windows ISO file such that all you have to do is download and import the same ISO file in your local cache without redistributing it directly, this could cause issues.

I believe compression should be optional or configurable for individual artifacts. There is also the issue of potentially using zip files instead of tar files: how would this work in this case? As much as I'd like tarballs to be the norm on Windows, zip files are unfortunately still king.

SteveLasker commented 3 years ago

@deitch

That is the part I don't get. As long as the media-type is application/vnd.microsoft.word.config.v10, then it is in the manifest,

There's a difference between the artifactType (oci.image/manifest.config.mediaType) and the contents of the layers|blobs. Today, the ORAS cli focuses on a single file, or a folder. In the single file scenario, ORAS needs a way to state the filename, not just the extension. Just because the artifact is registry.com/foo:v1, the file in it may be mypresentations.pptx To do this with oras push, when an individual file is specified, we needed the filename. Yet, if I specify oras push with a directory that has one file, ORAS tars the directory and maintains the directory and the filename. So, why not just use tar all the time? If the ORAS client used the tar format, even for the single file, then we no longer need the annotation, as the tar format maintains the filenames :)

@awakecoding

...but I am not particularly interested of replacing the annotation by tarballs, especially for small files.

tar through me at first as well. Tar isn't compression, rather a "tape archive format" for a collection of files. tar+gzip is compression. So, we can use tar to simply say, take this collection of files (including a collection of 1) and place them in tar which gets pushed as a blob. Since tar maintains the exact format, you won't lose anything in compression for digests or otherwise.

As for other compression formats, this is where we could do a better job on factoring the ORAS cli and the ORAS libraries. The idea was, let's generate a set of common libraries that folks can use for their thing cli, such as helm, wasm, singularity, ... But, for prototyping, and general use, what if we had an ORAS cli, that used the libraries, to do the most common steps?

awakecoding commented 3 years ago

@SteveLasker since .tar alone doesn't provide compression, but only bundles multiple files into a single file, it should normally result in identical hashes for the same content. However, the digest of a file is different from the same file inside a .tar archive, even without compression. While using tar everywhere would make the manifest look cleaner, it appears to be that this would only move the problem elsewhere. This simple modification alone makes it impossible to refer to unmodified artifacts distributed externally, unless these artifacts are already using tar. I have in mind .zip files, but also .iso, .vhdx, etc.

The filename itself is a suggestion, such that you can always decide to pull it under a different name. However, a lot of artifacts already have fairly good names that you just don't want to override. For instance, if I push "en_windows_server_2019_updated_feb_2021_x64_dvd_277a6bfe.iso" as a single-file artifact, I definitely want to remember the filename, and probably keep exactly that name when pulling it again, but I can always rename it to something else (that would be up to command-line options).

Regarding directories, one could decide to push all files individually, or compress the entire directory. The compression strategy should ideally be adaptable to common compression formats as much as possible. tar is very useful and I see why it was used, but we also have to keep in mind zip files, and other formats like 7-zip.

Directories are a bit more tricky, because we want to annotate more than one thing:

One thing that drives most people crazy when extracting archives is the dreaded issue of the presence of a root directory or not. If the archive contains a root directory, then "extract here" just extracts a nice, clean directory. If the archive has no embedded directory and you use "extract here", you have now polluted your download directory with tons of files.

This is why I suggest that we also annotate the root directory within the archive, and encourage using no embedded root directory at all ("."). This should help correctly annotate any type of archive to have a predictable decompression behaviour, and avoid extracting MyArchive.zip into "MyArchive/MyArchive/MyFile.txt" but also get "MyArchive/MyFile.txt" or "MyCustomDirectory/MyFile.txt".

deitch commented 3 years ago

@SteveLasker I edited your comment to add a newline, as it was mixing your quote of what I wrote with your response. If I got it wrong, go ahead and change it.

deitch commented 3 years ago

@SteveLasker if I understood you correctly, you are saying that once we use tar, which by definition includes filenames in tar headers, then we don't need the annotation. it is the moral equivalent (actually pretty close, technically) of instead of making the artifacts the contents of a file (e.g. "these are some contents"), but the contents and the title (e.g. `file.txt:"these are some contents").

That makes sense. It could have some interesting (and, I believe, positive) ramifications around some of the stores that expect a file, but then again, might make it even easier. Rather than that whole discussion last week about if the default behaviour (error or ignore) for an artifact missing a title that is passed to a filestore, we could just handle tars and nothing else.

Of course, this does make this much closer to the default registry behaviour. Once everything is in a tar, what is different from a regular image, other than how we expand it (and not even then)?

awakecoding commented 3 years ago

@deitch @SteveLasker I agree that a "tar by default" approach keeps the manifest small and makes the whole thing much simpler to deal with. I'm all for adding such a mode of operation, and I'll let you guys decide if it should become the default, but how should we handle the non-tar modes?

Pretty much all of the use cases I'm looking into would not use the tar approach, and work towards describing the contents of the artifacts in different ways to support multiple modes of operations for different use cases. There is one use case in particular that bothers me with the tar approach: large file artifacts would still be wrapped in a tarball, forcing an extraction of the file after pulling the blob. Some known published artifacts (ISOs in particular) are recognized by the hash of the original distributed file, making it much harder to recognize the same non-distributable ISO file in separate registries if it's the hash of a compressed tarball.

I would like to experiment with adding a lot of annotations for my own implementation of OCI Artifacts, to describe the following:

Are annotations the best way to add new data that would be accepted by most OCI registries? I would love to add new JSON fields but these look like a "don't touch" area. Even for annotations, those need to be vendor-prefixed. I'm thinking of going that route, but it would unfortunately mean that I start becoming incompatible with the rest of OCI Artifacts implementations unless my extensions somehow get accepted at some time in the future, and get renamed under the "official" OCI vendor prefix.

SteveLasker commented 3 years ago

@deitch

you are saying that once we use tar, which by definition includes filenames in tar headers,

Yup, Keep It Silly Simple

Once everything is in a tar, what is different from a regular image, other than how we expand it (and not even then)?

A container image has very specific meanings and behavior to be a runtime image. If a container tool/runtime sees an oci.image, the expectation would be it could run it. There's a config with entry points, platform expectations, etc. Are we expecting the oci.image format to be just a collection of files? If so, is the entrypoint required, and is the layer collection expected to be ordinal and an overlay?

I should also clarify:

@awakecoding

Pretty much all of the use cases I'm looking into would not use the tar approach ...There is one use case in particular that bothers me with the tar approach: large file artifacts would still be wrapped in a tarball

Would you expect to use the oras cli, the oras go libraries or some other runtime like powershell or rust?

I'm trying to understand if we're talking about the behavior, capabilities, or overall design.

The other thing to remember is an artifact author can do what makes most sense to them for their artifact type. The registry is just a box of blobs (defined by the manifest) that make up some content. The registry shouldn't care about the details, rather it can store the blobs, optionally de-dupe blobs, and manage garbage collection.

So, in your case, you can absolutely put anything in the blob. It's really up to the spec of your artifact type to decide. The registry knows to accept and deliver blobs.

Are annotations the best way to add new data that would be accepted by most OCI registries?

If this helps, I wrote this post a while back regarding the use of annotations and config: OCI Artifact Authoring: Annotations & Config.json

The manifest schemas are very difficult to edit, as registries are codified around specific manifests. They provide a range of flexibility, that should provide what you need between config, annotations, blobs/layers and the means to define these within your own oci.image.manifest: manifest.config.mediaType or the pending oci.artifact.manifest: manifest.artifactType

awakecoding commented 3 years ago

@SteveLasker at this point I think I'm just going to develop my own tooling in PowerShell and Rust under the "Sogar" project banner, such that I can have full control over the manifests and how to deal with them. Ideally, I would like to avoid diverging too much from ORAS or at least keeping some level of compatibility with it, such that artifacts pushed by ORAS can be pulled by Sogar in the same way and vice versa. My main concern is that once I decide to just go in my own direction, there will be little interoperability between the two projects because of incompatible manifests.

This is fine since it will allow me to experiment with all the different ways I could annotate the artifacts to cover a wide range of use cases outside of the "tar-first" approach. Since this issue is for ORAS specifically, I would consider my comments out-of-scope for the current PR, so just ignore it and move along. Still, do you want ORAS to be general purpose and adaptable, or just one way to deal with OCI Artifacts? I'm just curious what the scope of the project is.

shizhMSFT commented 3 years ago

@SteveLasker Putting everything in a tarball, especially one file per tarball, probably need more discussions in terms of engineering and security.

Artifact Digest

The digest in the manifest is no longer the digest of the file inside of the tarball. It makes the user harder to verify if the registry has got the correct content. Precisely, the user cannot simply do sha256sum artifact.txt and compare the digests.

Another problem with tar is that tarballs contain too many extra info than actually needed (see Header). What kind of metadata should be included in the tarball header by default? If we include the ModTime, then we might have two tarballs with the same content but has different digest.

The following two tags 1.0 and 1.1 have the same content of artifact.txt but with different digest.

oras push demo42.azurecr.io/artifacts/test:1.0 artifact.txt
touch artifact.txt
oras push demo42.azurecr.io/artifacts/test:1.1 artifact.txt

It's not good for server-side deduplication.

Artifact Ordering

What if two layers contains artifacts with the same name? For example, we have two tarballs and each of them contains a file named artifact.txt with the manifest shown below.

{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.unknown.config",
    "digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "size": 0
  },
  "layers": [
    {
      "mediaType": "application/tar",
      "digest": "sha256:603ea6780f25eb11c6c733af50b10f7f45ef4121d044f3eea011fa9a128884e4",
      "size": 21
    },
    {
      "mediaType": "application/tar",
      "digest": "sha256:9dfd755ae58e0e6559633a5ab6fabd76e6470a59ae5462603e93307887fca4fe",
      "size": 26
    }
  ]
}

Which artifact.txt will finally be written to the disk? The one in 603ea6 or the one in 9dfd75 or neither?

Actually, this is a malicious artifact that the client should reject. However, with the above manifest, the client just cannot verify unless it cracks all the tar files.

However, with the current manifest

{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.acme.rocket.config",
    "digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "size": 0
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar",
      "digest": "sha256:603ea6780f25eb11c6c733af50b10f7f45ef4121d044f3eea011fa9a128884e4",
      "size": 21,
      "annotations": {
        "org.opencontainers.image.title": "artifact.txt"
      }
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar",
      "digest": "sha256:9dfd755ae58e0e6559633a5ab6fabd76e6470a59ae5462603e93307887fca4fe",
      "size": 26,
      "annotations": {
        "org.opencontainers.image.title": "artifact.txt"
      }
    }
  ]
}

It is clear that this manifest is malformed and should be rejected since it has two artifacts named artifact.txt.

Client Security

In general, compressed files lead to more security vulnerabilities when being decompressed. oras is capable of doing security check for decompression to defend attacks like the path traversal attack. However, not all implementers in all languages have this capability.

Note: tar is considered as a compression method with compression ratio less than 1.

SteveLasker commented 3 years ago

Thanks @shizhMSFT, The focus on security is always important. What I'm suggesting here isn't what the underlying ORAS libraries would be capable of. What I'm suggesting is we split the ORAS binary from the libraries. The underlying ORAS libraries have the functionality to store files directly into blobs, and they can optionally use the annotation, or they could tar a collection of 1 or more files. But, even there we shouldn't force the use of a specific annotation, as it's really up to the artifact author to decide. The ORAS libraries would have no opinion on the experience,, just the raw capability.

Then, the ORAS binary has an opinioned view, with a focus on the simplicity of the experience, and cleanliness of the manifest and annotations used.

deitch commented 3 years ago

The key element for me, here, is Steve's most recent comment. This is not about the library, which (as I understand it) will continue to maintain all of its capabilities. It can:

This is just about the CLI having one default behaviour, that is not quite as secure or as powerful, and chooses behaviours we might not want for many use cases, but gets people started simply.

I understand and can get behind that. I am a little concerned about losing some regular users' advanced usage. For example, I often use the oras CLI to pull or look at things, essentially a useful debugging tool (although I also use my own ocidist to inspect stuff in registries, equally often), or to push things to a registry quickly, essentially checking behaviour for something I might be building. It also is useful to use in scripting. Why write a binary if oras can do what I need?

Could we make the suggested behaviour the default, but still be able (with options) to do some of the (now-defined-as) more "advanced" behaviours?

Hades32 commented 2 years ago

We're currently using the oras CLI and would really like to not have to write our own CLI on top of the lib. The way it currently is, is not perfect (no file-mode persistence, no compression) but it allows clients to easily discover what a repo contains and download individual files as needed. This proposed change would make this use-cases hard/impossible.

You wrote above

The ORAS client is intended as a reference implementation of the ORAS libraries

In my mind a reference implementation should showcase all available features. So, while making the default experience easy is always a good idea, I think it would be an error to remove functionality.

SteveLasker commented 2 years ago

Hi @Hades32, thanks for the feedback. There's no loss of functionality proposed. Just flipping some defaults so you don't have to set the values for a default experience.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 30 days with no activity.