opencontainers / distribution-spec

OCI Distribution Specification
https://opencontainers.org
Apache License 2.0
811 stars 202 forks source link

Allow URL pointers in manifest list #362

Open Snaipe opened 1 year ago

Snaipe commented 1 year ago

Unfortunately, as of today, the current manifest-list specification is extremely hostile towards registry implementations with dynamically-generated content. We would like to propose relaxing the manifest-list format to allow manifest list entries with no digest, pointing to other manifests via an URL.

Motivation

We have an in-development registry that we've been using at our company with great success so far, that generates images on-the-fly. This avoids needing storage and garbage-collection of these images. This works because builds of said images are highly reproducible, and we can uniquely refer to them using git commit hashes as their tags.

Using manifests v2 has so far been possible and rather easy: making a GET /v2/<name>/manifests/<reference> looks up the image in cache, and if it misses, builds a new image and returns the manifest of said built image, and everything thereon uses the blobs endpoint as normal.

We've started hitting problems when we wanted to support multi-platform images. The problem is that we have to know upfront all of the digests of each architectures in order to return a manifest list, and this requires us to start a build for every supported architecture immediately, and wait on all of them to finish. This also means that users of the x86_64 image are necessarily impacted by the arm64 image build (and vice versa), and causes unnecessary churn.

For now, we are working around this problem by having per-node-platform deployments with respective per-platform image tags, but this is not ideal as we could just as well be maintaining one deployment for all platforms.

Proposal

It would be incredibly more useful if the manifest list could relax the requirement on the digest. It doesn't sound crazy to us to allow a registry to return manifest list entries containing a URL (maybe relative) and omitting the digest entirely. For instance:

GET /v2/example/image/manifests/tagname
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
  "manifests": [
    {
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "url": "/v2/example/image/manifests/tagname-amd64-linux",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      }
    },
    {
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "url": "/v2/example/image/manifests/tagname-arm64-linux",
      "platform": {
        "architecture": "arm64",
        "os": "linux"
      }
    }
  ]
}

In a sense, these manifest lists would be sparser, being pointer lists to content that is not yet known upfront.

imjasonh commented 1 year ago

Using manifests v2 has so far been possible and rather easy: making a GET /v2//manifests/ looks up the image in cache, and if it misses, builds a new image and returns the manifest of said built image, and everything thereon uses the blobs endpoint as normal.

This sounds like something I've done before, though my attempt at just-in-time image manifest generation was mainly as a toy / testing tool, and not anything I'd want to productionize. The issues you've hit around having to block serving the manifest until all the builds are done, so you can report the built-thing's digest, are more or less why this is untenable. So I agree with you there.

However...

Having manifests point to exactly the contents of its sub-manifests and blobs is arguably the main benefit of OCI manifests. It's a feature, not a bug. Having manifests point to mutable references-by-tag instead of immutable digests would be a pretty dramatic breaking change to the spec, and would likely violate a lot of important expectations.

Today, a manifest can be guaranteed to point to exactly some specific contents; if those contents change, a new manifest can point to them, but then that's a new manifest, with a new digest. Being able to update a manifest in-place without changing its digest is a non-goal of OCI.

As for adding a url field to the descriptor type, as you've proposed, you can aaaaalmost do this today with the existing urls field, except that:

Snaipe commented 1 year ago

I agree about your points for manifests; having image manifests whose contents are all immutable by virtue of being addressable by their content hashes under the /blobs API is fine and allows clients to have a strong caching story to avoid redownloading binary content over the wire. That part is all fine.

What we're finding dubious is upholding the same guarantees for manifest lists. In a normal multiplatform image download flow, the client asks for /v2/<repo>/manifests/<tag>, which returns a manifest list of all platforms, ignores everything but the manifest for the platform that interests it, pulls that manifest by digest, then pulls the image config and layers by digest also.

In principle, the platform-specific image manifest being addressed by hash could be cached locally which would avoid an extra fetch should a different manifest list be published with the same image. In reality, this just never happens: new manifest lists get pushed as a response to a new version of the software, and all image manifests that it points to are going to be different. So, optimizing around that use case certainly makes us raise an eyebrow.

This is also compound to the fact that when pulling an image by repo+tag the response is already indeterminate, and it's not possible to use any caching, the client has to ask the server for the manifest or manifest list of said repo+tag combination every time because the meaning of the tag could have changed. I certainly don't see any problem if that indeterminateness applies to the per-platform images when it already applies to the initial query, especially if forcing it to be deterministic has completely minute caching advantages.

As for adding a url field to the descriptor type, as you've proposed, you can aaaaalmost do this today with the existing urls field, except that:

No, I didn't ask that, sorry if it came across as this -- we're completely fine with it remaining as-is. To clarify, we're asking about adding an URL field to the manifests entry in the manifest list spec.

sudo-bmitch commented 1 year ago

What we're finding dubious is upholding the same guarantees for manifest lists. In a normal multiplatform image download flow, the client asks for /v2/<repo>/manifests/<tag>, which returns a manifest list of all platforms, ignores everything but the manifest for the platform that interests it, pulls that manifest by digest, then pulls the image config and layers by digest also.

In principle, the platform-specific image manifest being addressed by hash could be cached locally which would avoid an extra fetch should a different manifest list be published with the same image. In reality, this just never happens: new manifest lists get pushed as a response to a new version of the software, and all image manifests that it points to are going to be different. So, optimizing around that use case certainly makes us raise an eyebrow.

I don't quite follow. Runtimes today do check the digest of the manifest to know if it's changed, and I believe they also check the digest of the manifest list.

This is also compound to the fact that when pulling an image by repo+tag the response is already indeterminate, and it's not possible to use any caching, the client has to ask the server for the manifest or manifest list of said repo+tag combination every time because the meaning of the tag could have changed.

They can run a HEAD request on the tag to verify it has not changed.

These comments are also specific to runtimes pulling images. There are many other use cases. Image signing tooling can sign the manifest list, and the digest of that manifest list should uniquely identify all the content being signed, without any of that content being mutable. Similar for vulnerability scanners that indicate a particular digest is safe to run in production. I work with image mirroring tooling, and it depends on the digest of the manifest list to know if the underlying content has changed and needs to be copied. I believe each of these cases would be broken by allowing the same digest for a manifest list refer to mutable underlying content.

sudo-bmitch commented 1 year ago

As for adding a url field to the descriptor type, as you've proposed, you can aaaaalmost do this today with the existing urls field, except that:

No, I didn't ask that, sorry if it came across as this -- we're completely fine with it remaining as-is. To clarify, we're asking about adding an URL field to the manifests entry in the manifest list spec.

Aren't those the same? The entries in the manifest list are descriptors.

imjasonh commented 1 year ago

No, I didn't ask that, sorry if it came across as this -- we're completely fine with it remaining as-is. To clarify, we're asking about adding an URL field to the manifests entry in the manifest list spec.

Sorry I misunderstood. The OCI index type is specified here and states that items in manifests include fields from the descriptor type:

Each object in manifests includes a set of descriptor properties with the following additional properties and restrictions:

The link you posted describes the Docker-typed vnd.docker.distribution.manifest.list.v2+json, which isn't governed by OCI, and has slightly different wording than OCI's equivalent.

imjasonh commented 1 year ago

In reality, this just never happens: new manifest lists get pushed as a response to a new version of the software, and all image manifests that it points to are going to be different. So, optimizing around that use case certainly makes us raise an eyebrow.

In practice there are many multi-platform images that don't update atomically when new software is released. The ubuntu manifest list for example gets updated for each new version on a platform-by-platform basis as each platform's image is built. You can observe this when a new version is released, and tags get a series of updates, sometimes hours apart.

I don't consider the content-addressable nature of manifests to be mainly for the purposes of caching -- though as you say that's definitely a huge benefit with blobs, and much less so for relatively tiny manifests. The benefit to me is that you and I can both look at a manifest by digest and know that we're looking at the same thing, and (garbage collection and deletion aside) that will continue to be the same thing forever. If a manifest can point to a mutable URL, we lose that guarantee.

Snaipe commented 1 year ago

I don't quite follow. Runtimes today do check the digest of the manifest to know if it's changed, and I believe they also check the digest of the manifest list.

What I mean is that the client has to initiate some request to know whether the meaning of a repo+tag has changed or not, even if it's a HEAD request.

They can run a HEAD request on the tag to verify it has not changed.

Which doesn't change the fact that the client needs a working internet connection to lift the indeterminateness of the human-readable repo:tag string.

These comments are also specific to runtimes pulling images. There are many other use cases. Image signing tooling can sign the manifest list, and the digest of that manifest list should uniquely identify all the content being signed, without any of that content being mutable. Similar for vulnerability scanners that indicate a particular digest is safe to run in production.

I don't think this necessarily breaks. Signing a manifest list with URL pointers just means that the manifest list is valid as far as the pointers therein are concerned, meaning that the signature at least validates that the specified platforms are included and can be found at other endpoints of the registry. That said, this would require each image manifest to also be signed individually.

I work with image mirroring tooling, and it depends on the digest of the manifest list to know if the underlying content has changed and needs to be copied. I believe each of these cases would be broken by allowing the same digest for a manifest list refer to mutable underlying content.

This would break insofar that it would be necessary for the client to make a HEAD request on the specified URL to determine whether it also has changed.

To be fair, I think most of the problem here is that the original post was kind of broad in what was allowed. I think allowing any URL isn't the best move; I was initially imagining only URL pointers on the registry itself would be followed, but that doesn't necessarily guarantee what the endpoint implements. Maybe what should be allowed instead are just repo:tag names, which means that as far as image mirroring is concerned, this is compatible with registries serving dynamic content.

Aren't those the same? The entries in the manifest list are descriptors.

I had assume they weren't, since they don't have exactly the same members. They seem similar in function otherwise.

The link you posted describes the Docker-typed vnd.docker.distribution.manifest.list.v2+json, which isn't governed by OCI, and has slightly different wording than OCI's equivalent.

Ah, sorry about the confusion. Are these different specifications, or is the OCI spec the place where new development will occur going forward?

imjasonh commented 1 year ago

Ah, sorry about the confusion. Are these different specifications, or is the OCI spec the place where new development will occur going forward?

They're different specs. It's impossible to say whether they'll converge at some point in the future, but I personally hope so, if only to avoid exactly the kind of confusion we're hitting here 😆.

Snaipe commented 1 year ago

In practice there are many multi-platform images that don't update atomically when new software is released. The ubuntu manifest list for example gets updated for each new version on a platform-by-platform basis as each platform's image is built. You can observe this when a new version is released, and tags get a series of updates, sometimes hours apart.

I stand corrected, I didn't see how this worked, but I guess the rationale is that the manifest list of e.g. latest doesn't contain consistent versions, but instead contains the actual latest version as they've been built for each architecture?

I don't consider the content-addressable nature of manifests to be mainly for the purposes of caching -- though as you say that's definitely a huge benefit with blobs, and much less so for relatively tiny manifests. The benefit to me is that you and I can both look at a manifest by digest and know that we're looking at the same thing, and (garbage collection and deletion aside) that will continue to be the same thing forever. If a manifest can point to a mutable URL, we lose that guarantee.

That guarantee is only lost for manifest lists/indices and only for registries that make use of it, while allowing registries to actually scale a lot better. The reason we went with a dynamic registry was because the docker registry implementation just could not deal with the level of churn introduced by building and pushing every commit for every repo and locking it every so often for garbage collection. In contrast, we've had zero need for maintenance with our dynamic registry for the past year.

Realistically, what happens if indices (and only indices) allow pointing to other manifests by repo:tag name?

Running the scenario out loud, it seems to us that lifting the indeterminateness is mostly at the same stage as before, which doesn't seem too bad:

Overall, the most important guarantee, which is that the manifest returned at that time for foobar:v1.0.0-amd64 is fully determined, is still upheld.

It doesn't seem to change much whether the client knows what foobar:v1.0.0 means or not initially besides needing to perform 1 or 2 more operations. In the case where everything is fully cached:

So, yes, an extra request might be necessary, but it seems like a fair compromise with not many drawbacks at the level of an index.

sudo-bmitch commented 1 year ago

To oversimplify this, I think you're looking at the index as an abstraction that's part of the tag, while the rest of OCI treats the index as part of the DAG. Changing how that object is treated has security, functionality, and usability issues that I'm not comfortable with.

A lot of code, my own included, has assumptions that if the index digest matches, then so does all the child content. That impacts the ability to cache results, mirror content, sign content, and approve an image to be deployed in an organization.

Another big issue is that it's now possible to create loops in the DAG that can break code like garbage collection and deep copy tooling. The parent index can point to a child index that does not exist yet. And the child index can now point to the parent since it can know the parent digest in advance.

Snaipe commented 1 year ago

To oversimplify this, I think you're looking at the index as an abstraction that's part of the tag, while the rest of OCI treats the index as part of the DAG. Changing how that object is treated has security, functionality, and usability issues that I'm not comfortable with.

Perhaps, though this still ends up leaving us in a bit of a tough spot, because there is no way we can follow the spec (not because we don't like it, but because there is no workaround) to distribute our images. I'm also unsure what these issues around security, functionality, and usability are.

At some level, there has to be some intermediate between fully determined content and fully indeterminate content. This is why we have repo:image tags in the first place, because podman pull sha256:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03 is worse UX than podman pull alpine:3.16.

So, realistically, what can we do today, using the OCI spec, to do what we need to do here?

To rephrase the problem:

Today, podman pull our-registry.example/repo:v1.2.3 and podman pull our-registry.example/repo:f572d396fae9206628714fb2ce00f72e94f2258f are equivalent through our registry, where f572d396fae9206628714fb2ce00f72e94f2258f is the git commit pointed at by v1.2.3. Pulling these images cause our registry to look up the image in a distributed cache. If there is a cache hit, the image is returned. If there is a cache miss, our build system kicks off a build of the underlying software, which builds a linux distribution from source on our build cluster, and returns an image off that.

These builds are reproducible, but we don't really care if they aren't, because of two reasons:

  1. Pulling an image by hash just lets users pull development images, which are assumed to not be stable
  2. When times come to tag a release for a repo, we can bless a specific build so that it gets immortalized in cold storage.

This guarantees stability when it matters while providing flexibility for development.

Now, today, running podman pull our-registry.example/repo:f572d396fae9206628714fb2ce00f72e94f2258f returns an x86_64 image. We have recently acquired aarch64 servers for our build farm to build aarch64 images on the fly.

The issue is that in order to provide a consistent interface, we have to return an index for requests to /v2/<repo>/manifests/<tag> containing all possible builds of all possible architectures for a repo. This is rather troubling, because it ties together builds of all architectures, which may have different side effects, and forces the overall index build to tend towards the worst build of the set. In other words, if the x86_64 build takes 4 minutes and the arm64 build takes 4 hours, the index build has to take 4 hours, which is very confusing for our x86_64 customers. Worst case scenario is one build taking +∞ time, which I've seen happen from personal experience due to build system bugs, causing the entire index to become unavailable.

Now, we can of course provide per-architecture suffixes, e.g. v1.2.3-x86_64, but this has all of the caveats of pre-index/list images, and forces all deployments to be duplicated on a platform basis, so it's not really a viable option for us longer term.

So, the way we see it, these are our options:

  1. Work with the spec and add URL pointers to indices, which breaks this hard dependency.
  2. Work with the spec and pass platform specifics to the initial GET /v2/<repo>/manifests/<tag>, either via query params, request headers (could be part of content negotiation), or another endpoint. This honestly might be simpler, and makes for thinner requests since only the manifest that interests the client is transferred.
  3. Write an HTTP(S) proxy for the registry and provide one DNS name per platform, then ship configuration to point each cluster node and customer machine to their platform's proxy (e.g. an x86_64 node would proxy through proxy.x86_64.example.com, which would host a proxy that would recognize our registry hostname, and would return only x86_64 images). This seems possible in theory, but hard to deploy and maintain, as the proxy seems to govern more than just registry pulls, which means configuring upfront which hosts are going to be pointing at our registry.
  4. Add a non-standard extension to our registry and ship binaries to pull our images in our clusters and our customer's machines, which thanks to Go has been rather easy to do. This is our last resort, though it's likely to happen more than 3, even if we'd rather not do this.

Are there any other possibilities?

sudo-bmitch commented 1 year ago

I'm also unsure what these issues around security, functionality, and usability are.

Vendors may sign the index digest and depend on the content not changing for security. Functionality and usability of mirroring and caching may break if the contents are mutable. And there's a strong possibility of breaking tooling if loops can be created in the DAG.

At some level, there has to be some intermediate between fully determined content and fully indeterminate content.

That boundary is tags. Moving that into the index manifest breaks logic written into software today.

So, realistically, what can we do today, using the OCI spec, to do what we need to do here?

I don't know who else is doing build-on-pull for multi-platform images. If you don't want to build all the platforms, then separate tags is the only method that would work with registries today. You could also gracefully degrade, attempt to build a multi-platform image, but if it takes too long, return an index with only platforms built before the timeout. Clients would then need to know to try pulling the image again if their platform wasn't available. That experience isn't far from how some builds create multi-platform images today, adding new platforms to the index as they are finished, so pulls get a different index as builds finish.

With changing the spec, you could try changing the client to remove the accept header for multi-platform manifests and include an extra hint of which platform manifest it wants. Existing registries wouldn't recognize the hint and will likely default to linux/amd64. You would want to handle that scenario, knowing that pulls may be sent to both a new and old registry from the same client.

sudo-bmitch commented 1 year ago

Following up because I see one project implementing this by leveraging authentication, an extra field added to the username to trigger the registry to change which image is returned: https://app.metahub-registry.org/

Snaipe commented 1 year ago

Interesting; I guess we could go that way. Have usernames be json-encoded documents specifying which image "flavor" to get when pulling, and configuring the hosts and clients to run podman login -u '{"architecture":"amd64","os":"linux","features":["sse4"]}' registry.example.com