oras-project / artifacts-spec

Apache License 2.0
61 stars 30 forks source link

Is the `mediaType` field necessary in the artifact manifest? #42

Closed michaelb990 closed 3 years ago

michaelb990 commented 3 years ago

The mediaType field is present in the artifact manifest spec, presumably copied over from the OCI image manifest or image index specs. According to those specs, it's a field that is kept around for backwards compatibility / reserved for use. I'm not sure that reasoning applies here, so I wonder if we should either:

ref: https://github.com/oras-project/artifacts-spec/blob/main/artifact-manifest.md#oras-artifact-manifest-properties

SteveLasker commented 3 years ago

Thanks Michael, this does warrant some more clarifications. The updated PR removes the copied "reserved" wording, and focuses on the usage.

The argument against including the mediaType was it's stored externally to the storage of the manifest. Before attempting to parse the document, the mediaType is known. The mediaType is pushed in the header when posting to a registry.

However, this doesn't solve the standalone evaluation of the manifest as an offline document. For most container runtime scenarios, the manifest is generated on the fly, as part of registry push, and not kept around, or built offline. The notary v2 work clarified the need for offline, ephemeral clients where all the content is built and validated before being pushed to a registry. The new graph of linked artifacts (through the subject property) changed those needs. To create the following offline graph, you need local manifests. Having them in the document assures the values are preserved when you post the documents to the registry.

  image
  ┣signature
  ┣sbom
  ┣─sbom-signature
  ┣gitbom
  ┣─gitbom-signature
  ┣scan-reusult
  ┕─scan-result-signature

The concern wasn't that it's invalid to have the mediaType in the document, rather it's duplicative. You could make a similar argument that a digest and size are redundant. But, they're used for verification purposes. We could add detail to a new artifact-authors.md document that explains the mediaType must match (header and content) or the manifest is invalid.

I had several conversations with Stephen Day and Mike Brown about this topic, and we agreed to keep mediaType. But, you might notice schemaVersion was removed, which was also part of that same conversation. So, the text should read the mediaType should reflect this double verification. If the mediaType stored externally doesn't match what's in the document, that's an error, just as the size and digest should match.

jonjohnsonjr commented 3 years ago

The concern wasn't that it's invalid to have the mediaType in the document, rather it's duplicative.

There isn't a concern about being duplicative, but inconsistent: https://github.com/opencontainers/image-spec/pull/411#issuecomment-261106466

So, the text should read the mediaType should reflect this double verification. If the mediaType stored externally doesn't match what's in the document, that's an error, just as the size and digest should match.

Folks involved seem to think it was a mistake to include, so I would recommend against this.

You could make a similar argument that a digest and size are redundant.

This doesn't make sense. How are they redundant?

sajayantony commented 3 years ago

If we navigate from a descriptor down to the actual manifest is makes sense to remove the mediaType. For e.g. in an index if the descriptor does have the mediaType of the manifest then definitely its redundant. The scenarios called out in the opencontainers/image-spec#411 of out-of-band transfer of mediaTypes is also valid concern. I remember having a conversation with @vbatts about self describing artifacts and if we needed these fields. The artifactType was was the interesting bit and if we removed that then validation tools wouldn't know if its should handle that manifest if it was given an array of descriptors and then we end up back into annotation land.

For a CAS engine also makes sense since we navigate from a ref ->descriptor either from the /referrers or get-manifest and request with ACCEPT which is the known mediaType.

Before we go down this path of removing the media type - I would like to understand how the manifest handler will disambiguate between different types of PUT. We need to specify what the content-type would be even if it isn't present in the manifest itself

https://github.com/distribution/distribution/blob/677772e08d6465bcdd09070d664bcd3ed64964a6/registry/handlers/manifests.go#L305

    mediaType := r.Header.Get("Content-Type")
    manifest, desc, err := distribution.UnmarshalManifest(mediaType, jsonBuf.Bytes())
    if err != nil {
        imh.Errors = append(imh.Errors, v2.ErrorCodeManifestInvalid.WithDetail(err))
        return
    }

So if we remove the mediaType from the body we should call out the mediaType that would be used by distribution so that this new type of manifest can be put and handled.

Does that sound reasonable?

jonjohnsonjr commented 3 years ago

So if we remove the mediaType from the body we should call out the mediaType that would be used by distribution so that this new type of manifest can be put and handled.

Does that sound reasonable?

This seems reasonable to me. Other than examples in READMEs, there aren't really any contexts where you would encounter a document without sufficient metadata to interpret it. E.g., registries and clients currently communicate this information via the Content-Type header, and the on-disk OCI Image Layout representation solves this by hard-coding the type of index.json, which contains descriptors for the rest of the content in the image layout.

FWIW, I agree that self-describing artifacts have some nice properties, but I don't think it's reasonable to overturn a decision without at least understanding why it was made and the pros/cons of that decision. If in doubt, I'd bias towards consistency.

michaelb990 commented 3 years ago

Seems reasonable to me, as well. I certainly didn't mean to imply that the mediaType should be removed entirely... My goal was to try and figure out why we have a field that seems to only exist for backwards compatibility reasons related to the Docker => OCI manifest spec process.

+1 to @jonjohnsonjr's point on index.json. I imagine that clients that need a local representation of an image and its related artifacts would use something similar to the Image Layout representation to do so.

Given that I believe both points are addressed, are there other concerns that we should discuss or should we go ahead and remove the field?

sajayantony commented 3 years ago

/cc @shizhMSFT - do you have any inputs here since this might affect client tooling.

SteveLasker commented 3 years ago

The manifest upgrade scenario, where pulling a v1 manifest can be pushed as a v2 manifest is the most compelling scenario to remove it. Scenario: Pull a v1 based manifest:

{
  "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
  "artifactType": "org.cncf.notary.v2",
  "blobs": [],
  "subject": {
    "mediaType": "application/vnd.oci.image.manifest.v1+json",
    "digest": "sha256:73c803930ea3ba1e54bc25c2bdc53edd0284c62ed651fe7b00369da519a3c333",
    "size": 16724
  },
  "annotations": {
    "foo": "bar"
  }
}

The above manifest has a digest and size based on the content, which includes: "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json" If a client understands a v2 version of the manifest, and the registry supports a v2 manifest, the client would/should/could attempt to upgrade. However, changing the value of the mediaType would change the digest and size and all references (subject pointers) to the original manifest, and those references would no longer be valid. 😞

I've been mostly concerned around offline scenarios, where we shouldn't need to construct a "store" that knows how to store different mediaTypes. A client should be able to store the files in a single directory and know how to parse them. While it's possible to infer, based on the schema, I was hoping to avoid this requirement.

The upgrade/conversion is far more interesting, and with more digging into the code (registry, oras, notation), even for references, this doesn't look like it's a problem. So, I'm ok removing mediaType from the persistence.

The only question I have is whether we remove it before cutting draft1, or make this change in draft2.

shizhMSFT commented 3 years ago

/cc @shizhMSFT - do you have any inputs here since this might affect client tooling.

Clients should be fine with the mediaType removal. As in Docker Registry HTTP API V2 #Manifest, GET Manifest always returns the media type of the manifest through the Content-Type HTTP response header.

Servers using distribution are impacted with mediaType removal since the media type is not persisted alongside the manifest blob. As shown in the code, the implemention cannot get the media type from the blob storage, and it is a guesswork to determine the media type of the manifest. For instance, there is no effective way to determine whether a manifest is an OCI Index or an OCI Manifest. It is possible to push an OCI Manifest but server thinks it is an OCI Index, and it is possible to push an OCI Index but server thinks it is an OCI Manifest.

To improve the server implementation, it is necessary to refine the blob store interface and update its implementation accordingly.

shizhMSFT commented 3 years ago

LGTM to remove the mediaType field.

aviral26 commented 3 years ago

Artifacts that are not self-describing would introduce some challenges in the GET handler: https://github.com/distribution/distribution/blob/main/registry/storage/manifeststore.go#L74

One possible solution would be to use artifactType before deserializing to the type. Any other suggestions?

SteveLasker commented 3 years ago

Resolved with #44