taskcluster / taskcluster-rfcs

Taskcluster team planning
Mozilla Public License 2.0
11 stars 19 forks source link

artifact metadata #156

Closed escapewindow closed 3 years ago

escapewindow commented 4 years ago

Artifact Metadata

The goal is to provide Artifact Integrity guarantees from the point the worker uploads an artifact, to the point that someone downloads that artifact to use. We can do this by:

  1. adding SHA metadata to artifacts in the Queue,
  2. ensuring that that metadata can't be modified once it's written, and
  3. providing a download tool that queries the Queue for an artifact's location and SHA, downloads the artifact, and verifies the downloaded SHA matches the SHA provided by the Queue.

Adding artifact metadata to the Queue

First, we add a metadata dictionary to the S3ArtifactRequest type. This is a dictionary to allow for flexibility of usage. The initial known keys would include

ContentLength int64 `json:"contentLength"`
ContentSha256 string `json:"contentSha256"`
ContentSha512 string `json:"contentSha512"`

The sha256 field is required for Artifact Integrity. Releng has use cases for all 3 fields, so I'm proposing all 3.

A future entry may be ContentSha256WorkerSignature, once we solve worker identity.

(Optionally we could also add a metadata dictionary to the ErrorArtifactRequest (error summary?) and RedirectArtifactRequest (live log socket info?) types, but it's not clear if we want or need those at this time.)

We could add a Queue.getArtifactInfo endpoint that returns the URL and metadata.

Ensuring that metadata can't be modified once it's written

I'm under the impression this will Just Work, given the nature of the Queue.

Providing a download tool

This is probably a thin wrapper around the taskcluster client library, that gets the metadata of the artifact, downloads it, and verifies any shas. We should allow for optional and required metadata fields, and for failing out if any required information is missing, or if the sha doesn't match. We should be sure to measure the shas and filesizes on the right artifact state (e.g. combining a multipart artifact, not compressed unless the original artifact was compressed).

This tool should be usable as a commandline tool, or as a library that the workers can use.

Once we implement worker signatures in artifact metadata, the download tool will verify those signatures as well.

Object Service

The future object service should be compatible with this proposal.


I can create an rfc once we come to an initial consensus here.

djmitche commented 4 years ago

@taskcluster/services-reviewers please share your feedback!

escapewindow commented 4 years ago

ContentLength int64 json:"contentLength"

I realized this may be ambiguous. There's the the gzipped content length, the multipart upload content lengths, and and the filesize on disk. We care about the filesize on disk, so perhaps filesize.

jvehent commented 4 years ago

I'm under the impression this will Just Work, given the nature of the Queue.

Can someone confirm this assumption?

djmitche commented 4 years ago

Can someone confirm this assumption?

It's something we should flesh out a little here. Two parts:

jvehent commented 4 years ago

The backend storage (postgres) would be configured to not allow updates that would modify the data (so only select, insert, delete, not update)

Right, I remember the discussion point now. This isn't an immutable data structure so much as an append-only database enforced via Grants. 👍

escapewindow commented 4 years ago

I'm going to start working on restructuring this issue into an RFC.