Support direct access to s3, etc., through URL schemes.

Applications that support directly accessing S3 or other cloud storage providers could be sped up (sometimes by a lot) by avoiding staging which is currently necessary in order to make files locally available. However, this must be done in a way that permits reflow to track dependencies for cache key construction and invalidation. With S3, reflow could produce signed URLs to avoid needing to plumb through credentials (and to more tightly control access to external resources). In order to be careful about changing files, the URL should also include the assumed e-tag, and applications should try to honor this (e.g., by failing if there is an e-tag mismatch).

There are a number of possible ways to provide this functionality.

We could provide a builtin function, url, which renders a signed url from a file or directory where supported. This could return a tuple (string, bool) indicating whether constructing a URL was possible to do.
Another option is to make it an option on execs: exec(..., urls := ["s3", "gcs"]) indicating the set of storage providers that are supported natively within the exec. If any file may be accessed directly with a supported storage provider, a URL is rendered instead of a local file path. If, for whatever reason, a URL cannot be rendered (e.g., signing failed, or the storage provider doesn't support external access) then the file (or directory) is staged and the file path is rendered instead.

I think the second provides better ergonomics, though poses some challenges if you want to mix URL and local access. We could provide some ways of overriding this also.

I can think of three conditions for the url builtin:

The file is only available in a cloud bucket and not locally on the alloc, or in the reflow cache.
The file is not available in a cloud bucket, because it is an output from some other exec. 2.1. The file is available in the same alloc as this exec. 2.2. The file is not available in the alloc, but is available in the reflow cache, so it can be accessed directly from there through a signed url.

I expect it to be rare where a remote url based file is already available on the alloc because another exec was using it. But in such situations, if reflow does not handle it transparently, then option 1 will give more control to the users to optimize in the most common resource allocation and execution scenarios.

grailbio / reflow

Support direct access to s3, etc., through URL schemes. #62