Relaxing referrers API requirements

opencontainers / distribution-spec

OCI Distribution Specification

https://opencontainers.org

Apache License 2.0

811 stars 202 forks source link

Relaxing referrers API requirements #357

Closed jonjohnsonjr closed 1 year ago

jonjohnsonjr commented 1 year ago

Under Listing Referrers:

The descriptors MUST include an artifactType field that is set to the value of artifactType for an artifact manifest if present, or the configuration descriptor's mediaType for an image manifest.

The descriptors MUST include annotations from the image or artifact manifest.

These requirements are rather burdensome for registry operators, particularly the annotations projection. Can we soften them to a MAY?

vbatts commented 1 year ago

hm. The artifactType field would be reasonable to expect, but the annotations one does seem like a SHOULD

sudo-bmitch commented 1 year ago

The scenario we've kept in mind for this is a signing tool that needs to sift through hundreds of signatures to deploy an image. Perhaps there's a tool generating lots of short-lived signatures, like the TUF time-stamping signature or a regular vulnerability scan.

If the registry doesn't include the annotations, then clients must pull every manifest to do client side filtering. And if registries may pick which annotations they include, then clients have to assume the annotation they need wasn't included vs missing from the artifact. So even a partial annotation list would be as useful as no list.

The resulting change to the API calls, per image being verified, goes from pulling a single matching manifest to pulling every possible artifact manifest match. Thinking of sites like Docker Hub, it would be trivial to exceed someone's rate limit by pulling and verifying the signature for a single image.

jonjohnsonjr commented 1 year ago

Summarizing some of the call from today:

For the registry to return annotations for each descriptor, we need to parse them out of the artifact manifest. We already have to parse the artifact manifest on PUT to interpret subject and blobs, so we could also pull out annotations at that point, but then we'd have to store them somewhere. That's not great because they can be arbitrarily large.

Alternatively, we can pull out annotations when constructing the referrers response, but that requires fetching and parsing every artifact manifest. Not great either.

Instead of projecting annotations onto the referrers response, what if we just rely on the data field? For small artifacts, the registry can fetch the manifest, encode it as base64, and stuff it in the data field so clients have access to the annotations (and any other fields) efficiently. For large artifacts, the registry can decide not to embed it and require clients to fetch it themselves.

sudo-bmitch commented 1 year ago

I'm not completely opposed to using the data field, but I worry that it has some risks. The ones that come to mind:

"the registry can decide not to embed it", as soon as a registry may do something, I worry they'll just skip the data field for everything regardless of size, unless there's a hard requirement.
Since the data field is significantly larger than just the annotations, I worry we'll trigger pagination sooner, increasing the number of API calls.
As usage of the data field increases in other scenarios (embedding small artifact blobs in the manifest of the artifact), the listing responses will grow in size quickly, or drop the data field, limiting the usefulness of the data field in the artifacts themselves.

My biggest concern is turning two small API calls (one for the listing, and another for the manifest), into a lot of API calls that limits their usefulness of the API and results in implementations using their own workarounds (pushing a custom tag with their own Index and annotated descriptors).

Thinking through how registries implement this, what if instead of indexing the annotations and other descriptor content, they generated the Index manifest response every time there's a push of a manifest with a subject field. Then the DB is tracking the digest of that Index (the API response) rather than generating the content on demand. On a manifest push, either the registry uses the existing response and appends a descriptor, or regenerates the response by reprocessing the content of each manifest (slower but maybe helpful in avoiding or recovering from race conditions).

mikebrow commented 1 year ago

GET /v2//referrers/?subjectAnnotations=all/none/4k ?

michaelb990 commented 1 year ago

From the call, it sounded like we have several options under consideration.

Option 1 (stay the course): have registries uplift annotations into the referrers API response.
Option 2: use the data field instead and return the entire manifest.
Option 3: just return the descriptor without annotations or data, instead have clients make additional requests to get the manifests for each artifact they care about (only filtering by artifactType).

Personally, I'm open to changing this if we need to, but I would like to push for consistency. In other words, I don't want to relax everything to MAYs & SHOULDs and move on. I'd like to get to a place that provides clear guidance to how the implementation should work. If we expect clients to rely on annotations being uplifted, let's keep it a MUST. If we want to use data, let's remove the annotations uplift altogether. And finally, if we're going to strip it all out, are we okay making clients make multiple round trips in the case of multiple artifacts of the same artifactType? In the WG, one of our goals was to reduce round trips, but that doesn't necessarily mean it's the only consideration we should have. Doing multiple of these options at once seems like it will cause a lot of additional (and unnecessary) complexity.

jonjohnsonjr commented 1 year ago

I think for sophisticated clients, the data field is almost always better than embedding annotations because you can avoid an additional roundtrip, and the overhead of the rest of an artifact is minimal (~200 characters per embedded descriptor * ~1.33 overhead for base64) when compared to annotations.

If we would be willing to drop the annotations hoisting altogether and rely on the data field, I think we'd have better performance characteristics overall.

OTOH, this has two drawbacks I can think of:

this makes it harder to use for ad hoc querying with tools like jq, and
if an artifact embeds large content itself via the data field, it would balloon the referrers response size.

The first drawback is unfortunate, but I don't know how important that is to folks.

The second drawback seems minor, honestly. There are huge benefits to this double-embedding if the innermost content is small enough because the overhead of each HTTP request saved would outweigh the downside unless you're dealing with a huge number of artifacts and don't have any ability to filter/sort them.

Overall, I'm pretty unhappy with the referrers API, but I will give up on this for the sake of everyone else.