opencontainers / distribution-spec

OCI Distribution Specification
https://opencontainers.org
Apache License 2.0
780 stars 201 forks source link

Clarify supported digest algorithms for manifests #494

Open andaaron opened 7 months ago

andaaron commented 7 months ago

The image spec mentions multiple registered digest algorithms (https://github.com/opencontainers/image-spec/blob/v1.1.0-rc5/descriptor.md#digests), out of which SHA256 is the canonical one.

The distribution spec mentions these registered digest algorithms can be used to reference manifests when: A) pulling - https://github.com/opencontainers/distribution-spec/blob/v1.1.0-rc3/spec.md#pulling-manifests

The Docker-Content-Digest header, if present on the response, returns the canonical digest of the uploaded blob which MAY differ from the provided digest. If the digest does differ, it MAY be the case that the hashing algorithms used do not match. See [Content Digests](https://github.com/opencontainers/image-spec/blob/v1.0.1/descriptor.md#digests) [apdx-3](https://github.com/opencontainers/distribution-spec/blob/v1.1.0-rc3/spec.md#appendix) for information on how to detect the hashing algorithm in use. Most clients MAY ignore the value, but if it is used, the client MUST verify the value against the uploaded blob data.

B) pushing - https://github.com/opencontainers/distribution-spec/blob/v1.1.0-rc3/spec.md#pushing-manifests

The <location> is a pullable manifest URL. The Docker-Content-Digest header returns the canonical digest of the uploaded blob, and MUST be equal to the client provided digest. Clients MAY ignore the value but if it is used, the client SHOULD verify the value against the uploaded blob data.

Case A) seems relatively clear to me, but in case B) the expected behavior of the registry is not entirely clear to me, as the digest (and the implicitly the algorithm) may or may not be provided by the client.

I assume the MUST be equal to the client provided digest is the value of the reference. The clarifications needed are: 1) In case the reference is a digest, and it is using a non-canonical digest algorithm, how can it be equal to the Docker-Content-Digest header value which is always using the canonical digest algorithm? 2) In case the reference is a tag, is there a way for the server to know what algorithm it should use to track this manifest?

If this topic has already been discussed or documented, please provide a link, I could not find relevant issues here.

Thank you

sudo-bmitch commented 7 months ago

I assume the MUST be equal to the client provided digest is the value of the reference. The clarifications needed are:

  1. In case the reference is a digest, and it is using a non-canonical digest algorithm, how can it be equal to the Docker-Content-Digest header value which is always using the canonical digest algorithm?

I'd consider removing "canonical" from the Docker-Content-Digest references. That would allow it to match the client provided value.

  1. In case the reference is a tag, is there a way for the server to know what algorithm it should use to track this manifest?

I don't think there is, clients get the registry server preferred value.

  • Is the registry supposed to only use the canonical algorithm in this case? That could cause issues if the client pushes another manifest referencing the initial manifest by a digest computed using a non-canonical algorithm.

  • Is the registry supposed to compute all possible hashes, using all registered algorithms, of a given manifest (or blob in general) in order to make sure any possible future references can be successfully resolved?

I think this hints at some of the many waiting issues we'll face whenever someone tries to use anything other than sha256. I expect a lot of things will break the day someone tries to switch to sha512. In addition to parts of the spec not covering all the scenarios, I expect registries, runtimes, and other client tooling, are all lacking full support.

rchincha commented 7 months ago

cc: @ktarplee

ktarplee commented 7 months ago

I am going to assume that we want to be able to copy an image from one registry to another without the manifest ID changing. That is the digest of the manifest does not change during a copy and thus none of the digest algorithms for the blobs may change as well. We have that property now and I am proposing we keep it. This implies a few things:

  1. The digest algorithm cannot be forced to change for blobs or manifest by the registry. Of course if the source image is using sha512 and the destination registry does not support that digest algorithm, then the copy fails unless the tool wants to act like skopeo by converting the digest algorithm (that is exactly what happens when uploading to Zot to be OCI compliant, therefore the digests change).
  2. The client gets to choose which algorithm will be used in all cases. Any case I can think of where the registry picks the digest will break (1). So we need the API to allow client to provide the digest of the uploaded artifact (or at least the digest algorithm desired) for all upload requests and download requests. Downloads come for free so the issue is really the uploads as the issue description points out.
  3. The registry should only need to know the names (digests or tags) of the content that the client has previously provided. Therefore if the client did not push a blob by digest with sha512 then they should not be able to pull it by that same sha512 digest. We need this property for blobs due to efficiency reasons. The registry might still choose to always use a canonical digest (sha256) for duplicate detection but that that should not mean it is available as that canonical digest as well. This property is broken in the current OCI registry because sha256 manifests are always provided even if the client did not push to that manifest. I think that would need to change so I support removing "canonical" from the Docker-Content-Digest references and replacing it with the digest algorithm provided by the client. So Docker-Content-Digest always uses the client's digest algorithm, and fails otherwise.

In regards to the question "In case the reference is a tag, is there a way for the server to know what algorithm it should use to track this manifest?". I think the solution has to allow the client to provide the digest algorithm but since this is a manifest I think it should be the entire digest. Given the above argument, I would say that both of the suggested options are not desirable:

- Is the registry supposed to only use the canonical algorithm in this case? That could cause issues if the client pushes another manifest referencing the initial manifest by a digest computed using a non-canonical algorithm. - Is the registry supposed to compute all possible hashes, using all registered algorithms, of a given manifest (or blob in general) in order to make sure any possible future references can be successfully resolved?

I am proposing that the client provide the expected digest (while uploading a manifest by tag) by either adding a header to the request Docker-Content-Digest: sha256:deedbeef... or to the query parameters ?digest=sha256:deedbeef.... The registry can validate that the manifest matches the digest and returns that same digest provided by the client (not necessarily the canonical digest) in the response header Docker-Content-Digest. If the client does not provide a digest when uploading a tag then the canonical digest is used (so everything is backwards compatible).

The key here is the client must always provide the digest algorithm. The registry can never pick one it wants unless the client gives up it's right to specify the digest algorithm for manifest or blob.

ktarplee commented 7 months ago

Another aspect of this to think through is the referrers API. When the list of referrer descriptors is returned, what digest algorithm should be used? Does it return duplicate content (i.e. two descriptors to the same content but different algorithm) or just one. Is the the registry expected to de-duplicate those references? Do you limit it to the just descriptors using the same digest algorithm (I don't think so).

flowchart TD
    A[Manifest A\nsha256:3...] -->|subject\nsha256:0...| C[Manifest C\nsha256:0...\nsha384:1...\nsha512:2...]
    B[Manifest B\nsha256:4...\nsha512:5...] --> |subject\nsha512:2...| C

In the above diagram the digests in the manifests are the digests that the client used to upload that manifest in that registry.

Imagine a client makes a request to /v2/<name>/referrers/sha256:0.... What should be returned by the registry?

Imagine a client makes a request to /v2/<name>/referrers/sha384:1.... What should be returned by the registry?

One solution/rule is that the referrers API should return all known references to objects referred to by the provided descriptor. So in the above example, both manifest A and B should be returned when querying for sha256:0... or sha384:1... or sha512:2.... This is harder for registries to implement because it requires them to realize that sha256:0..., sha384:1... and sha512:2... are actually the same manifest.

Alternatively the rule can that only the manifests with the subject matching exactly are returned. So in the example above, referrers of sha256:0... would be manifest A, and referrers of sha512:2... is manifest B. And there are no referrers for sha384:1... even-though they are the same actual manifest. In this case, registries can effectively treat the manifests as unrelated entities. We do loose some functionality in this case by not picking up all referrers to a manifest. I slightly prefer this approach.

sudo-bmitch commented 6 months ago

I am proposing that the client provide the expected digest (while uploading a manifest by tag) by either adding a header to the request Docker-Content-Digest: sha256:deedbeef... or to the query parameters ?digest=sha256:deedbeef....

The query parameter would align nicely with the blob put. That would be my preference.

Alternatively the rule can that only the manifests with the subject matching exactly are returned. So in the example above, referrers of sha256:0... would be manifest A, and referrers of sha512:2... is manifest B. And there are no referrers for sha384:1... even-though they are the same actual manifest. In this case, registries can effectively treat the manifests as unrelated entities. We do loose some functionality in this case by not picking up all referrers to a manifest. I slightly prefer this approach.

I tend to prefer that as well. Registries could treat each digest algorithm as a separate list of entries in the blob store, so pushing the same manifest with two different digest algorithms would be two separate CAS entries. That also aligns with the storage model of the OCI Layout. Without that, registries would need to compute multiple hashes for every item received, and make the content available by multiple CAS names, which I expect is problematic and creates a significant overhead on large registries to support new algorithms. I doubt registries want to recompute the digest on all of their content given the understandable push back we saw for including referrers responses that were previously pushed by the fallback tag when registries enable the referrers API.

sudo-bmitch commented 1 day ago

In addition to supporting the push of manifests by tag with a non-canonical digest algorithm, I think we need similar support when a blob is pushed with the digest only being provided after the content is pushed (in the POST, PATCH, PUT). For that scenario, the client would only know the algorithm when the POST and PATCH requests are being run. Perhaps a ?digest-algorithm=sha512 URL parameter should be used in those scenarios?