taskcluster / taskcluster-rfcs

Taskcluster team planning
Mozilla Public License 2.0
11 stars 20 forks source link

rfc 0158 - artifact metadata #158

Closed escapewindow closed 4 years ago

escapewindow commented 4 years ago

Fixes #156 .

djmitche commented 4 years ago

@ajvb it'd be great to hear from you here, too (I just can't flag you for review)

escapewindow commented 4 years ago

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

escapewindow commented 4 years ago

Wasn't this implemented already as the blob storage type?

See https://github.com/taskcluster/taskcluster-lib-artifact-go#overview

Yes, but aiui the blob storage type was deprecated. Because everything is s3artifacts and is likely going to stay that way for the foreseeable future, let's add the sha metadata there.

@petemoore does this answer your question?

djmitche commented 4 years ago

My recollection is that blob storage was fully implemented

It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service.

escapewindow commented 4 years ago

My recollection is that blob storage was fully implemented, with clients developed for both node.js and go (and possibly even python) - and it also had some nice features such as supporting concurrent multipart uploads, in addition to SHA validation - but for some reason, at some point, the code was removed from the Queue, and thus it is no longer available server-side. I don't have context on why it was removed, but in any case it is likely to have bitrotted by now if we were to add it back. For me, I see it as a shame, it was a good library, and we tests demonstrating that it functioned well. No doubt there is some context I am missing though.

It sounded good, and I would have happily used it if it weren't deprecated. (And if it worked properly :)

It would be useful to describe in the RFC how backward compatibility would be maintained for existing artifacts (will we retrospectively add SHAs to existing artifacts, or just make them optional?)

I think they should be optional, especially for older artifacts. A cluster-wide setting for new s3artifacts could be useful (requireS3ArtifactMetadata ?). A worker setting may be sufficient here. We could set the worker setting on all firefoxci cluster worker pools, and possibly enforce it via ci-admin checks.

Other questions:

* Are there any existing API endpoints that need updating that already provide artifact data?

The rfc just specifies the new Queue.getArtifactInfo currently. I think we will update the S3Artifact request and response schemas to allow for extra optional metadata, but I believe that would work with the existing createArtifact endpoint.

We would essentially need to create a download tool to verify the metadata in the queue matches the artifact we downloaded, and use it everywhere, to fully take advantage of this metadata.

* Is the metadata dictionary string -> string only?

Could be. The sha256/sha512 would be string -> string. The filesize entry could be an int or a stringified int. This line in the rfc currently says filesize is an int.

* Will there be a json schema for the metadata dictionary?

Yes, for all supported metadata. Currently that's just the three entries here, all of them optional; again, I'm thinking we can enforce which ones we set per worker.

Does that make sense?

escapewindow commented 4 years ago

LGTM

I'd like to see the detailed breakdown of how this replaces part of CoT (https://github.com/taskcluster/taskcluster-rfcs/pull/158/files#r454092603) within the RFC so that it is easy to refer back to later

How's https://github.com/taskcluster/taskcluster-rfcs/pull/158/commits/e11edc22a8400e4b974364f9f313b5fc25745c32 ?

petemoore commented 4 years ago

My recollection is that blob storage was fully implemented

It was partially implemented and never deployed, and had some unaddressed design issues that would have been difficult to address without breaking changes. It remains in version control as a source of inspiration and maybe some code. It was removed from master so that it didn't accidentally come into use, and to minimize the number of migration paths we'll need to support when we build the object service.

@djmitche can you please provide more substantive context on the unaddressed issues? To the best of my knowledge, the following two libraries were implemented and working:

https://github.com/taskcluster/taskcluster-lib-artifact https://github.com/taskcluster/taskcluster-lib-artifact-go

They interfaced with the server-side implementation, which suggests the server-side implementation was also working. Certainly, it was reviewed and merged to master, and at the time John considered it to be feature complete. I was also successfully using it, and there was a suite of unit and integration tests that suggested it was working and feature complete.

The only missing piece I am aware of is that the workers were not updated to use the new artifact upload/download library. However it sounds like you have insight into issues that I am not aware of, so I am interested to understand those better, especially if addressing those issues is a smaller task than rewriting everything from scratch.

It feels like Aki is going to have a lot of work to do to reimplement everything, so some very specific details about missing functionality would be useful, and a link to the server-side implementation that you deleted may help simplify things for the reimplementation.

@escapewindow Hopefully the links above for the client side libraries and upload/download tools will save you some time and effort too.

escapewindow commented 4 years ago

+1. If blob artifacts work, can be rolled out everywhere for FirefoxCI, and we can provide guarantees that the hashes can't be altered post-upload, then it looks like that would be sufficient for my purposes.

escapewindow commented 4 years ago

As I understand it:

Current workers don't upload multipart and don't verify that uploads happened without issues. Outside of Chain of Trust and other non-platform-supported options, we don't verify download hashes either.

BlobArtifacts, if fully implemented and rolled out, would speed up uploads by supporting multipart uploading, which may result in run-time cost savings. We would verify uploads at upload time. We could mark the uploading task as failed or exception on upload failure, rather than moving bustage downstream to the downloading task(s). We could also obsolete this RFC, since BlobArtifacts require sha256 hashes for each artifact and provide a download tool.

However, we would need to spend some engineering effort to complete BlobArtifact support. This effort may be higher than adding extra information to the existing S3Artifact type. BlobArtifacts are S3-specific and would take significant engineering effort to support other cloud providers.

We may need to either prioritize implementing BlobArtifacts or the object service. The object service would support multiple cloud platforms, provide cross-region and cross-cloud mirroring, and improve artifact maintenance and culling. These will result in storage and bandwidth cost savings.

escapewindow commented 4 years ago

Should we merge this?

djmitche commented 4 years ago

Yep, thanks for the bump.