Object service (second round)

jhford commented 6 years ago

The discussion in #116 is getting really hard to follow. This is a new RFC to continue after the first couple rounds of feedback. If things from that PR aren't addressed here, it was missed due to confusion, so please mention it again here and we can continue the discussion.

jhford commented 6 years ago

From #116, responding to @djmitche:

A few editing things inline. There's nothing in here I don't like, just stuff I don't see addressed!

Remaining concerns from my perspective (reiterating some from my previous comment):

Are the object service's API methods idempotent? Could/should we make them so?

I don't see why we can't or shouldn't, so I can clarify that we will make them so.

How does the object service say "yep, I already have that file" if deduplication were implemented?

If we deduplicate or do content addressable storage, for the creation endpoint, I think we should send a response with no requests needed, but otherwise the same flow. The later complete endpoint would also be able to run again, but both of these endpoints not changing the objects, other than maybe some metadata about expiration

Regarding content-addressable and deduplication, I think we'd need to have a layer on top of the actual blob. The thing stored would have to have an absolute expiration, but we'd need to have a second layer of names and have each name have an associated expiration time. The absolute expiration is the latest of the name's expiration. We want to make sure that just because two files are identical that one cannot extend the lifetime of the other's name.

The DELETE-caches endpoint doesn't make much sense in a deduplicated world -- the DELETE operation is really only deleting the name, not (necessarily) the underlying data. I know deduplication is not an initial feature here, but I think it is wise to build an API that is compatible with deduplication.

But even in a deduplicated world, we'd likely only transfer a single copy of each object to caches. In the case where we have an object with hash abcd123 and names A and B, a cache clear request for A would delete the cached copy of the object with hash abcd123, thus clearing the cache for A and B, since any underlying issue affecting A would also be affecting B.

What are the timing constraints on an upload?

We don't have any off hand now and I think we should set those based on real world testing.

What happens if the post-upload method isn't called? I assume that we'll reflect what S3 does and time that upload out after some long duration. What is that duration?

If the post-upload method is not called, a request to get the artifact would return a 400-series error, since the state of the upload is not known. As S3 artifacts work, multipart upload attempts can be set up to expire after a configurable time period, and single part uploads which are never completed could be found in the database for cleanup fairly easily.

Does this service just distinguish "public" and "private" or does it have other levels of access control?

I think we should use a single level: private. The idea would be that all requests to this service are signed. The request to queue is the only request that should be user facing, and by gating all access to artifacts through the queue, we're ensuring that the S3 buckets are used only in ways that we support. This would mean that a request for an artifact stored in us-west-2 from a worker in us-west-2 would:

GET queue.tc.net/artifact/run1/task1/name1 (tc-auth Authorization header or none if public)
302 objects.tc.net/objects/run1_task1_name1?signatureOfName=X&signatureExpires=Y
302 s3-us-west-2.amazonaws.com/abcd12345?X-Amz-Sign-Headers=All (S3 Signing)

By doing this, we ensure that the only way to access artifacts publicly is through the Queue's API. This is dependent on performance of the object stores. Doing that means that we don't need to worry about people reverse-engineering our S3 storage names, and can make changes to the scheme without breaking compatibility.

I'm not sure how signing should work with this service, but I'd like to make it so that signed requests to this service must initiate before 30minutes from signing and that they be for a specific method of an endpoint

For private artifacts, how can we remove the authentication information from the URL?

Based on the above, all the distinction between public and private would happen in the queue. The only time authentication information would be given would be during 302 redirects, and those URLs would expire after a delay. In order to support redirects, we'd need to use signed URLs instead of the Authentication header, since we'd end up having two redirects and the value for the Authentication header would need to be changed, and I'm pretty sure that's not possible to do while preserving interoperability with standard http clients.

What's the reason for wanting to drop this from the redirected URLs?

How are requests to the object service authenticated?

I haven't figured that out yet, but I'd like to sign urls in the queue and validate them in this service, then have this service generate the signed urls for the underlying services.

jhford commented 6 years ago

@djmitche Yah, that's a fair point. I'd like to be safe by default, so maybe we could do something like have a public=true query parameter for artifact creation? Since we're having that level of visibility, I wonder if we would want to have something which can change between public and not public.

djmitche commented 6 years ago

a public=true query parameter for artifact creation

This would be for future-proofing later deduplication? It wouldn't hurt -- the model is simple and already proven in tooltool.

Jonas noted that quite likely non-Mozilla users of Taskcluster will want all of their artifacts to be private.

jhford commented 6 years ago

This would be for future-proofing later deduplication? It wouldn't hurt -- the model is simple and already proven in tooltool.

Sorry, this is regarding having URLs which are public without any auth stuff, specifically to support the ctrl+c/v usecase you mentioned. I'd rather that all objects in this service are 100% private 100% of the time, and leave public urls as a Queue feature.

Jonas noted that quite likely non-Mozilla users of Taskcluster will want all of their artifacts to be private.

Yep, no argument here, and that was sort of what I was leaning towards anyway, aside from the ctrl+c/v usecase you mentioned.

djmitche commented 6 years ago

Sorry, this is regarding having URLs which are public without any auth stuff, specifically to support the ctrl+c/v usecase you mentioned. I'd rather that all objects in this service are 100% private 100% of the time, and leave public urls as a Queue feature.

I don't understand how that query argument would work, then - at what URL would the GET redirect chain terminate?

jhford commented 6 years ago

I've just pushed a new draft of the document. @djmitche @petemoore @jonasfj please take a look when you have a chance.

@djmitche it seemed like the outcome of our discussion regarding signed urls for ultimate artifacts being stored came to the conclusion that services which automatically post links to artifacts should start posting queue.taskcluster.net urls instead.

We also found that the wrapper p.o.c. that was written would acceptable, with the main concern around detecting browser vs. api user. If needed, we could use that.

From a security standpoint, we should probably disallow serving HTML from artifacts to browsers. We should be able to set a CSP which forces a sandboxed environment for HTML served from our domains so that scripts aren't executed.

jhford commented 6 years ago

@jonasfj and I had a meeting today about the upload-info endpoint and we feel like we've come to a good solution. We're thinking that what we should do is:

drop /upload-info -- it's overhead and we don't really need it
Specify in the docs that we're always going to do our absolute best to support a known partsize, e.g. 64mb, that we're going to do our absolute best to continue to support. Because some services may not support that particular part size, we need to be prepared for the case that 64mb works
In the case of a non-supported part-size we will return a 400-series error from the creation endpoint which has a machine readable property with information on which part sizes should work.
- possible reasons for this 400 include:
- 64mb parts not supported by service -- this shouldn't happen but is possible
- user using a part size other than 64mb
lib-artifact would understand how to parse these error messages and cache the partsize that the queue should use in memory so that this only happens a single time

With this, we don't need to worry about having the extra http request, we don't need to worry about caching the /upload-info endpoint and we don't need to worry.

djmitche commented 6 years ago

So basically the 400 response fills the role of the /upload-info response, but only when the suggested partsize doesn't work. That will need care to make sure that lib-artifact's fallback behavior is well-tested, but that's doable.

jhford commented 6 years ago

So basically the 400 response fills the role of the /upload-info response, but only when the suggested partsize doesn't work. That will need care to make sure that lib-artifact's fallback behavior is well-tested, but that's doable.

Yep, it's one of the complexities for sure, but I think it's better than doing the extra request if it's not completely needed.

In general, we should have a nice rest api, but I think for this service, because of the number of moving parts, we really should be pushing our artifact handling library and CLI tool as the supported method for users to interact with artifacts, rather than the underlying rest api.

jhford commented 6 years ago

I believe that all input has been addressed at this point. I'm going to move this into final comment until at least Tuesday. Please let me know if there's anything still open for debate.

catlee commented 6 years ago

This may be out of scope for the service itself, but one thing I don't see addressed is how to manage objects over time.

Given past experience with needing to clean up old data from NFS shares or S3 buckets, it's super useful to have some metadata associated with the objects indicating e.g. where they came from, who created them, and if they're still actively used.

If there's anything that can be added to the object service to make long term management of the content easier, then we should definitely consider it!

jhford commented 6 years ago

@catlee

This may be out of scope for the service itself, but one thing I don't see addressed is how to manage objects over time.

The request body on creation

{
  origin: '10.10.10.10' || 's3_us-west-2',
  contentType: 'application/json',
  contentLength: 1234,
  contentSha256: 'abcdef1234',
  transferLength: 1234,
  transferSha256: 'abcdef1234',
  contentEncdoing: 
  expiration: new Date(),
  parts: [{
    sha256: 'abcd1234',
    size: 1024,
  }, {
    sha256: 'abcd1234',
    size: 1024,
  }],
}

has an expiration parameter. This parameter would be used either to set an expiration policy (e.g. S3) or a cleaner process that goes through and deletes things alive past their expiration date.

I should add a section to the document which explains this. Thanks for raising this concern!

jhford commented 6 years ago

@indygreg @ccooper does this comment and the specific comment replies address your concerns?

I have high-level concerns about how much the specification leans on advanced features of HTTP while skirting specification compliance. That could come back to bite you. Especially if consumers of this service will be "advanced" user agents (like browsers) that have expectations about spec conformance.

There's no expectation that a browser would directly perform an upload ever. While we're defining the API at the HTTP level, our supported interface will be libraries and a CLI tool. This is because of the complexity required in having uploaders upload directly to backends instead of having to write an ingestion layer. Even if a browser were to do an upload, it would be required to do it using things like XHR and a full implementation of the uploading flow.

I agree that we should aim for spec compliance, but the only place where full HTTP spec compliance would matter to a browser or other advanced client is on object retrieval, which I think we're doing correctly.

One area where I'd like to see more attention is that of end-to-end verification. When uploading, the SHA-256 and content length are specified. That's great! But what measures are in place to ensure content integrity after upload?

In the S3 api, that's already addressed. We generate signatures which are only valid for the payload bodies which have the exact sha256s we've supplied in the request body to this api.

In other words, if you have a Zip file with hash abcdef123, and you ask to upload that zip file, the actual calls to S3 will fail if the request body you provide doesn't hash exactly to abcdef123. Within the context of S3, it is not possible to upload incorrect data (barring S3 breaking their defined API). This is implemented using the standard AWS request signing procedure.

This lets us have our workers upload directly to S3 without having to implement our own artifact ingestion system, which is really important to being able to scale our service.

We will always attempt to support the same for other services/backends, but in the cases where that's not possible, it would be handled in the complete upload endpoint (PATCH in the current draft) as needed. Worst case, the complete upload endpoint for services which do not support these features could fetch and verify the uploaded object before marking the object as complete.

I'd really like it if the storage service stored the content length and hash and then exposed these on retrieval so any consuming agent could validate the integrity of the data. Detecting data corruption can be very useful and it would be great if the GET response contained the hash so a consumer could optionally perform integrity checking.

That is what this service is. It does everything you're wanting, and it already does so in the current blob artifact in the queue if you'd like to see the basis for this design. S3 does not provide this information, and the only time they handle SHA256 computations is in their Auth layer, which is never persisted in a user-visible way. Even their Etag values for single-part uploaded objects are only an Md5 hash and their docs state never to rely on them being a hash of the object.

On the subject of content integrity checking, a single hash is better than no hash. But if you are a security conscious consumer, you should validate the integrity of data before doing any processing.

Yes, and that's how taskcluster-lib-artifact[-go] are implemented. These libraries are the fully supported interface to artifacts. It's not 100% deployed yet, but will be.

Otherwise malformed data could exploit security vulnerabilities in code that processes that data. When you have large inputs, this means buffering large amounts of data (to memory or disk) so you can compute the hash before doing anything with it. This is wasteful of resources and reduces the time before data can be processed. A workaround is to compute the rolling hash and then send N hashes of the data. The consumer can buffer up to the rolling hash amount (say 32 MB) and process that data sooner without having to wait for the full payload to arrive. This is super useful for achieving streaming processing in security sensitive contexts. I hope we could somehow incorporate a rolling hash into the object service to achieve these desirable properties.

Yes, but in the normal case where there is no corruption and no exploit, we're going to have to have the whole resource on disk anyway. We'd never request downloading a resource just to verify that it's valid (except possibly inside this service). That would actually be a mistake to implement since the first request could send different bytes to the ones which get saved to disk on the second request.

If you were downloading an xz file and wanted to stream it directly to an uncompressed version, the better approach would be to use something like tee (or in Go an io.MultiWriter) to send each received byte to a SHA256 calculator and an XZ decompressor. This is actually how the taskcluster/taskcluster-lib-artifact-go works to handle this exact case, with content-encoding gzip decompression.

If you really didn't want to send the data through an XZ decompressor before validating the hash, you'd by definition also have to write the full file to disk to ensure that the file you hash is exactly the file you decompress.

If we're experiencing hashes which do not mesh, wasted data transfer is the absolute least of our concerns. I think that doing rolling hashes is an interesting idea, but in this instance it feels like a premature optimization.

jhford commented 6 years ago

@djmitche Github won't let me reply inline for some reason

I don't think we use it without a body, though -- that could conceivably cause issues with HTTP implementations.

that sentence was written before updating the method used. I've slightly reworked it to state that a zero-length body should be used.

I think the confusion here is that it looks like "abcd123", "u-123", and "123" are related.

Very possible. I've changed the example to reduce the reuse of 123.

jhford commented 6 years ago

@indygreg when you have a chance, can you let me know if there's any other issues?

If there's not, let's call this decided.

taskcluster / taskcluster-rfcs

Object service (second round) #120