opencontainers / image-spec

OCI Image Format
https://www.opencontainers.org/
Apache License 2.0
3.54k stars 658 forks source link

Formalize support for zstd compression: v1.1.0 ? #803

Open thaJeztah opened 4 years ago

thaJeztah commented 4 years ago

While reviewing https://github.com/moby/moby/pull/40820, I noticed that support for zstd was merged in master (proposal: https://github.com/opencontainers/image-spec/issues/787, implementation in https://github.com/opencontainers/image-spec/pull/788 and https://github.com/opencontainers/image-spec/pull/790), and some runtimes started implementing this;

However, the current (v1.0.1) image-spec does not yet list zstd as a supported compression, which means that not all runtimes may support these images, and the ones that do are relying on a non-finalized specification, which limits interoperability (something that I think this specification was created for in the first place).

I think the current status is not desirable; not only does it limit interoperability (as mentioned), it will also cause complications Golang projects using this specification as a dependency; go modules will default to the latest tagged release, and some distributions (thinking of Debian) are quite strict about the use of unreleased versions. Golang project that want to support zstd would either have to "force" go mod to use a non-released version of the specification, or work around the issue by using a custom implementation (similar to the approach that containerd took: https://github.com/containerd/containerd/pull/3649).

In addition to the above, concerns were raised about the growing list of media-types (https://github.com/opencontainers/image-spec/issues/791), and suggestions were made to make this list more flexible.

The Image Manifest Property Descriptions, currently describes:

Implementations MUST support at least the following media types:

  • application/vnd.oci.image.layer.v1.tar
  • application/vnd.oci.image.layer.v1.tar+gzip
  • application/vnd.oci.image.layer.nondistributable.v1.tar
  • application/vnd.oci.image.layer.nondistributable.v1.tar+gzip

Followed by:

...An encountered mediaType that is unknown to the implementation MUST be ignored.

This part is a bit ambiguous (perhaps that's just my interpretation of it though);

What's the way forward with this?

  1. Tag current master as v1.1.0, only defining +zstd as a possible compression format for layers, but no requirement for implementations of the v1.1.0 specification to support them
  2. Add the +zstd compression format to the list of required media types, and tag v1.1.0; projects implementing v1.1.0 of the specification MUST support zstd layers, or otherwise implement v1.0.x
  3. Wait for the discussion about "generic" layer types (https://github.com/opencontainers/image-spec/issues/791, https://github.com/opencontainers/image-spec/issues/799) to be completed before tagging v1.1.0
  4. Do a v1.1.0 release (1. or 2.), and leave 3. for a future (v1.2.0) release of the specification.

On a side-note, I noticed that the vnd.oci.image.manifest.v1+json was registered, but other mediatypes, including media-types for image layers are not; should they be?

SteveLasker commented 3 years ago

I get the inheritance, FROM model. But, why wouldn't it just push the newly compressed layers (both new layers and original layers that can now be compressed), to up-level the entire thing? What benefit do you get by having a mixed mode?

jonjohnsonjr commented 3 years ago

What benefit do you get by having a mixed mode?

This allows for very cheap composition of images that happen to use different types of compression, e.g. if you want to rebase a gzip-compressed image onto a new zstd-compressed base image, which is a major selling point of buildpacks (cc @sclevine is there a better doc than this?), you can achieve this with some minor editing of manifests. If you required each layer to have the same compression scheme, you'd have to recompress the old layers or the new base to match, which would be expensive and defeat the goals of cheap rebasing.

I think we're mostly stuck on how a single tag on an artifact can reflect multiple things. Does anyone disagree that a unique tag, with a revised mediaType version, could support anything new?

I don't think we're stuck here, because this is exactly what an image index does. I'd be interested in how a new mediaType would be functionally different from an image index other than breaking all existing clients.

SteveLasker commented 3 years ago

The buildpack angle is definitely interesting, and something we're hoping can solve the quick-servicing events. But, do we see zstd as the new standard? Or, is there a benefit of keeping gzip around as well? If we're trying to make a move to zstd for better performance, then shouldn't we help the entire ecosystem move forward? Once re-compressed with zstd, wouldn't you get even better performance.

Stuck on single tags: This goes back to what goals we're trying to solve. If I want to add zstd format:

  • must I update the existing gzip manifest and have a single tag?
  • must I create a new manifest that supports zstd, and use the index to split based on what the user wants? In either case, if we want a single tag to support both gzip and zstd, hello-world:latest would need to either point to an index (with a single tag) that splits to individual manifests. Or, a single manifest, that supports both compression formats.
thaJeztah commented 3 years ago

if we want a single tag to support both gzip and zstd, hello-world:latest would need to either point to an index (with a single tag) that splits to individual manifests.

The hello-world image is already a manifest list (multi-arch); adding alternative compressions would mean having multiple variants for the same os/arch (existing clients would pick the first match; newer clients can pick more specific, based on extra info/annotations describing the compression)

(at least, I think that's roughly what's being proposed/discussed above)

SteveLasker commented 3 years ago

Back to the mediaType, I am suggesting zstd would use a different mediaType. Certainly for the layers, but'd also suggest this is where we could do a version bump on the manifest as well. This way we know if the index supports a manifest that supports zstd. Just as an index can identify if a specific platform/arch combo exists.

jonjohnsonjr commented 3 years ago

The hello-world image is already a manifest list (multi-arch); adding alternative compressions would mean having multiple variants for the same os/arch (existing clients would pick the first match; newer clients can pick more specific, based on extra info/annotations describing the compression)

(at least, I think that's roughly what's being proposed/discussed above)

Just as an index can identify if a specific platform/arch combo exists.

This is exactly what I'm proposing, yes. One of the approaches (adding a field to Descriptor) would even extend down to the image manifest / layer level, if we go that way. See below.

something we're hoping can solve the quick-servicing events

What are quick-servicing events and how do they solve rebasing?

Or, is there a benefit of keeping gzip around as well? If we're trying to make a move to zstd for better performance, then shouldn't we help the entire ecosystem move forward? Once re-compressed with zstd, wouldn't you get even better performance.

Yes, the benefit is not breaking every single existing client in use today. I'm fine with moving towards zstd, but there is a delicate sequencing of events that need to happen, else this will be a complete disaster, e.g. python 3.

Roughly:

  1. Decide on the mechanism for adding zstd support with gzip fallback. This is where we are.
  2. Socialize this plan with registry operators (mostly on this thread, but there are many others).
  3. Make sure existing registries allow pushing these new manifests. (This might be zero work or tons of work, depending on (1).)
  4. Make sure clients don't break when pulling these new manifests.
  5. Start publishing images in this new (mixed) format. Update clients to understand zstd and prefer those blobs for mixed images.
  6. Wait for adoption of new clients and for old clients to drop off ~entirely. This part will take O(years). This is where we are for schema 1. (We should probably instrument registries in a way that we can figure out when gzipped blobs stop getting downloaded entirely and agree on some exit criteria that is healthy for the ecosystem.)
  7. Stop publishing mixed manifests, only publish zstd images.

Back to the mediaType, I am suggesting zstd would use a different mediaType.

Of course, this is already merged. They should be application/vnd.oci.image.layer.v1.tar+zstd. I poked the facebook folks to register the +zstd suffix a while ago, so we're not blocked on that.

Certainly for the layers, but'd also suggest this is where we could do a version bump on the manifest as well.

I'd be okay with a backward-compatible version bump (e.g. what I'm proposing below). I'm leaning on the kubernetes versioning guidelines a bit here, but we should be able to add fields without completely breaking everything, as long as old clients ignore fields they don't expect (this might not be safe to do -- for example, GCR will probably reject what I'm proposing, not sure if any clients would break from this).

Do we need to do a version bump for strictly additive changes?

This way we know if the index supports a manifest that supports zstd.

I don't know if it makes sense to distinguish a manifest supporting zstd vs gzip if we're going to allow mixed compression (I think we should). Looking at my example you can imagine that instead of doing this at the manifest list level, we could have alternate descriptors on an image manifest at the layer level. This is if you want to support zstd but also be backward-compatible with gzip.

Concrete example, using debian@sha256:5b5fa7e155b1f19dffb996ea64e55520b80d5bd7a8fdb5aed1acabd217e9ed59:

Before:

{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
   "config": {
      "mediaType": "application/vnd.docker.container.image.v1+json",
      "size": 1463,
      "digest": "sha256:6d6b00c22231693c9b87e79986d562874446bf10182206e4621e23ca8dfa8e1c"
   },
   "layers": [
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 50397728,
         "digest": "sha256:6c33745f49b41daad28b7b192c447938452ea4de9fe8c7cc3edf1433b1366946"
      }
   ]
}

After:

{
   "schemaVersion": 2,
   "config": {
      "mediaType": "application/vnd.oci.image.config.v1+json",
      "size": 1463,
      "digest": "sha256:6d6b00c22231693c9b87e79986d562874446bf10182206e4621e23ca8dfa8e1c"
   },
   "layers": [
      {
         "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
         "size": 50397728,
         "digest": "sha256:6c33745f49b41daad28b7b192c447938452ea4de9fe8c7cc3edf1433b1366946",
         "descriptors": [
           {
             "mediaType": "application/vnd.oci.image.layer.v1.tar+zstd",
             "size": 41414997,
             "digest": "sha256:157a9fb3cd509e4de73599dcdd51a32b5f8e9cfa356516560dd08f298a4db072"
           }
         ],
         "annotations": {
           "zstd-alternative": "sha256:157a9fb3cd509e4de73599dcdd51a32b5f8e9cfa356516560dd08f298a4db072"
         }
      }
   ]
}

(It might make sense to have platform.features on layer descriptors, or we can just rely on annotations. They're functionally equivalent to me.)

Of course, maybe it does make sense to partition them at the manifest list level and never support mixed compression, in which case my original example would work, I think, but per the rebasing use case, mixed compression is useful.

cpuguy83 commented 3 years ago

@SteveLasker

zstd doesn't need to have a different media type. A manifest is a manifest regardless of the content it holds, and it also doesn't solve the problem of older clients.

We are looking for a way to progressively upgrade clients of existing images. A client that understands how to decode zstd should be able to look at debian:buster and know how to get the zstd version, and at the same time not break clients who don't know anything about zstd.

giuseppe commented 3 years ago

Until we reach an agreement on how to support multiple compression variants, could we at least move forward and expect such images to be made explicit (e.g. by a tag)?

Even when multiple compression formats are supported in the same manifest, I don't think that we should enforce that each image is available in multiple formats and there will be zstd-only images that will anyway fail with old clients.

vbatts commented 3 years ago

holy wow. This thread got off the rails. Let's recap.

I think we should enable folks to try different mediatype objects, even if that only means a different compression. If we choose to disagree about this, it will only make the work-around implementations nastier, and thereby our own OCI Artifacts effort will suffer from it.

jonjohnsonjr commented 3 years ago

but this is the same issue we have with the prospect of digests using...

There is some spec language around this, but I agree we're in a similar spot regardless. A lot of this is hardcoded in older clients, and I really doubt that most modern clients would work as expected even if they purport to support other digest algorithms.

I think we should enable folks to try different mediatype objects, even if that only means a different compression. If we choose to disagree about this, it will only make the work-around implementations nastier, and thereby our own OCI Artifacts effort will suffer from it.

I agree with this, but I think what was merged just hand-waved the problem away in https://github.com/opencontainers/image-spec/pull/759 without actually addressing the backward-compatibility concern.

If we want to somehow make an image that works for both gzip-assuming and zstd-supporting clients, just having both kinds of layers and the "ignore stuff you don't understand" clause is insufficient, and we don't currently have a strategy that does work. I don't know if this issue is the right place to figure that out, but I agree with @giuseppe here:

I don't think that we should enforce that each image is available in multiple formats and there will be zstd-only images that will anyway fail with old clients.

The "ignore stuff" PR is probably not fine for this either? Even if old clients just ignored all the zstd layers, there will be no layers, and the resulting filesystem will be empty. Should clients just fail outright if they don't understand zstd? I think it's fine to push zstd-only images, especially if you control the clients that will be pulling them, but I don't think the spec language is firm enough to write a client that behaves reasonably, here, in basically any situation...

tych0 commented 3 years ago

The "ignore stuff" PR is probably not fine for this either? Even if old clients just ignored all the zstd layers, there will be no layers, and the resulting filesystem will be empty.

Are you talking about #816? I tried to make it clarify exactly this point that runtimes cannot ignore layers they don't understand, and must fail:

Container runtimes cannot satisfy the bit about:

The final filesystem layout MUST match the result of applying the layers to an empty directory.

by ignoring layer types, so the spec is slightly in conflict with itself here, by requiring runtimes both MUST apply all the layers and MUST ignore layers that they cannot understand.

jonjohnsonjr commented 3 years ago

Are you talking about #816?

Ah, I missed this one, thanks.

Specifically this bit:

However, image transfer tools can safely ignore mime types they don't understand (indeed, skopeo does this today);

Assuming this is for skopeo copying things, I don't think it is safe to ignore mime types we don't understand? E.g., if I want to actually copy an image to a different registry, I need to copy all the referenced blobs, or else the image is incomplete. I don't need to try to parse them, but I just can't just ignore them... right?

One approach would be to assume any unrecognized media types are blobs, and copy them from /v2/.../blobs/, which should work most of the time. However, one place this strategy wouldn't work is if we specify completely new manifest media types (e.g. the artifacts stuff). Clients do actually need to know how to parse those if they're going to do anything reasonable, and assuming they're just blobs will of course fail. If that fails, do we just ignore the failure? That seems bad. (Here lies a lot of my hesitation around defining a new artifact type.)

Probably makes some sense to discuss this more on your PR, and this gets into distribution stuff as well...

tych0 commented 3 years ago

I just can't just ignore them... right?

Right, so maybe the commit message needs re-wording. The actual spec text and example from the commit matches what you're saying, though:

https://github.com/opencontainers/image-spec/pull/816/files#diff-7333951212fd75b50f5be1023afc9d5d41f0533adc240407c43d4bfb42c6019aR65

jonjohnsonjr commented 2 years ago

FYI https://github.com/opencontainers/image-spec/pull/880 may be interesting to folks.

almson commented 2 years ago

Please add zstd. Pulling images is so slow because of decompression.

zvonkok commented 5 months ago

/cc @zvonkok

sudo-bmitch commented 5 months ago

For those that aren't supporting zstd today, what is preventing it? Of the tools that don't support it yet and what is blocking the PR to add support?

For those that require this feature, do we have metrics showing that decompression is a slower step than the network speed to download the blob? Without that, then decompressing during the download (instead of waiting for the download to complete) would provide a performance improvement without zstd.

giuseppe commented 5 months ago

For those that require this feature, do we have metrics showing that decompression is a slower step than the network speed to download the blob? Without that, then decompressing during the download (instead of waiting for the download to complete) would provide a performance improvement without zstd.

For Podman/CRI-O, one zstd feature we are interested in using is "skippable frames" so we can embed some metadata in the compressed stream.

In addition to the faster decompression, zstd also requires less CPU time.

On my machine, using pigz for comparison as it is much faster than GNU gzip, I get:

$ dd if=/dev/urandom bs=1G count=1 | pigz - > /tmp/blob.gz
$ dd if=/dev/urandom bs=1G count=1 | zstd - > /tmp/blob.zstd

$ \time -v pigz -d < /tmp/blob.gz > /dev/null
    Command being timed: "pigz -d"
    User time (seconds): 0.27
    System time (seconds): 0.44
    Percent of CPU this job got: 200%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.35

$ \time -v zstd -d < /tmp/blob.zstd > /dev/null
    Command being timed: "zstd -d"
    User time (seconds): 0.14
    System time (seconds): 0.15
    Percent of CPU this job got: 206%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.14
sudo-bmitch commented 5 months ago

For Podman/CRI-O, one zstd feature we are interested in using is "skippable frames" so we can embed some metadata in the compressed stream.

What is preventing Podman/CRI-O from supporting these features today? Is there a PR blocked because of OCI?

In addition to the faster decompression, zstd also requires less CPU time.

Is this from content that was pulled from a registry, or content stored locally in a compressed state?

giuseppe commented 5 months ago

For Podman/CRI-O, one zstd feature we are interested in using is "skippable frames" so we can embed some metadata in the compressed stream.

What is preventing Podman/CRI-O from supporting these features today? Is there a PR blocked because of OCI?

nothing really :-) We are planning to use zstd by default on Fedora 41: https://fedoraproject.org/wiki/Changes/zstd:chunked

In addition to the faster decompression, zstd also requires less CPU time.

Is this from content that was pulled from a registry, or content stored locally in a compressed state?

stored locally in a compressed state

sudo-bmitch commented 5 months ago

We are planning to use zstd by default on Fedora 41: https://fedoraproject.org/wiki/Changes/zstd:chunked

Do we need to differentiate between zstd and zstd+chunked?

giuseppe commented 5 months ago

We are planning to use zstd by default on Fedora 41: https://fedoraproject.org/wiki/Changes/zstd:chunked

Do we need to differentiate between zstd and zstd+chunked?

No, it is still a valid zstd file. Clients that do not use the additional metadata will simply ignore it.

septatrix commented 4 months ago

We are planning to use zstd by default on Fedora 41: https://fedoraproject.org/wiki/Changes/zstd:chunked

Do we need to differentiate between zstd and zstd+chunked?

No, it is still a valid zstd file. Clients that do not use the additional metadata will simply ignore it.

What is the media type of zstd:chunked? If it is application/vnd.oci.image.layer.v1.tar+zstd:chunked it should still become part of the spec such that tools will know that they can use it (even if they just treat it as a normal zstd file). Otherwise tools will always need to upload images with both formats even though they are identical.

giuseppe commented 4 months ago

application/vnd.oci.image.layer.v1.tar+zstd:chunked

it is application/vnd.oci.image.layer.v1.tar+zstd

septatrix commented 4 months ago

application/vnd.oci.image.layer.v1.tar+zstd:chunked

it is application/vnd.oci.image.layer.v1.tar+zstd

How does a client then know if the layer is chunked? Does it always need to fetch the header and somehow determine if it is chunked using magic bytes?

giuseppe commented 4 months ago

How does a client then know if the layer is chunked? Does it always need to fetch the header and somehow determine if it is chunked using magic bytes?

it could do that, or use the annotations we added for that layer, e.g.:

        {
            "MIMEType": "application/vnd.oci.image.layer.v1.tar+zstd",
            "Digest": "sha256:9efd019b05bc504fcc4d0e244f3c996eb2c739f3274ab5cc746e0f421044c041",
            "Size": 113639090,
            "Annotations": {
                "io.github.containers.zstd-chunked.manifest-checksum": "sha256:f67017010afe34d9a5df4c1f65c6ff7ac7a452b57e7febea91d80ed9da51841e",
                "io.github.containers.zstd-chunked.manifest-position": "111910713:1066869:6231165:1"
            }
        }

so it can immediately fetch the TOC and validate it against the checksum that is recorded in the manifest as well

sudo-bmitch commented 2 months ago

I just spent a bit of time rereading this thread (it's a long one) and a lot has happened since the original issue.

  1. This didn't get resolved before the v1.1.0 release, but that doesn't mean we can't sort things out for a future release.
  2. The "MUST ignore" text confused (gestures wildly) everyone, so it was replaced with "Implementations storing or copying image manifests MUST NOT error on encountering a mediaType that is unknown to the implementation."
  3. We currently say "Implementations MUST support at least the following media types..." with only a list of uncompressed and gzip media types. I don't know if we are ready to make zstd a MUST, but adding a SHOULD list would make sense to me.
  4. In the image-index.md, we now have the text "If multiple manifests match a client or runtime's requirements, the first matching entry SHOULD be used" which perhaps should be adjusted slightly to clarify the runtime cannot distinguish between the entries. This was designed to give a bit of forward compatibility where issues like this could be resolved with an added entry in the index and an annotation in the descriptor in the index as a flag to the runtime. Some of this might be resolved with the Image Compatibility Working Group which is taking a break right now while folks attempt possible implementations in their free time.
  5. A lot of discussion on supporting multiple compression types has moved towards solving this with transport level compression, which the registry could pre-cache. But everyone is hesitant to go that direction knowing that older clients and registries would fallback to uncompressed layers, a very bad intermediate state.
  6. I'd expect a manifest with mixed compression layers to be possible, if not common, since there's value to maintaining the base image layers without recompressing them. That allows the layers to be mounted across repositories, reduces storage costs, and enables clients to dedup layers more efficiently.

Given everything that has happened, what is left to complete before resolving this issue?

Anything that I'm missing? If not, I can work on a PR for the manifest.md layers text so we can close this out.

tianon commented 2 months ago

(adding the meaningful bit of my thoughts from today's call)

A lot of discussion on supporting multiple compression types has moved towards solving this with transport level compression, which the registry could pre-cache. But everyone is hesitant to go that direction knowing that older clients and registries would fallback to uncompressed layers, a very bad intermediate state.

IMO, the intermediate state of "many older tools don't support the layers at all" (zstd) is worse than "many older tools will have increased storage, but everything will be functional" (uncompressed / transport compression) :eyes:

One of these is a really viable intermediate state to allow actual transition (that happens to also solve other interesting problems like layer digest vs DiffID) while the other is a complete non-starter (at least, from my own perspective as a large publisher of a number of popular images).

tianon commented 2 months ago

(oh, and I'm +1 on adding zstd to the "SHOULD support" list)

slonopotamus commented 2 months ago

2 cents: the real world (containerd, buildkit, docker, buildah, etc) already supports zstd. The fact that image spec lags behind just introduces a discrepancy. From technical POV, zstd is just unlimately superior to gzip in all aspects. In order to continue to make sense, image spec should declare zstd as a supported format.

tianon commented 2 months ago

The image spec was updated to support it, back in https://github.com/opencontainers/image-spec/pull/788, which was part of every RC and the GA of v1.1.0.

This issue is more about saying that implementations "SHOULD" support it, which is mostly semantics because, as you've said, all the major tools already do.