grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

Support direct access to s3, etc., through URL schemes. #62

Open mariusae opened 6 years ago

mariusae commented 6 years ago

Applications that support directly accessing S3 or other cloud storage providers could be sped up (sometimes by a lot) by avoiding staging which is currently necessary in order to make files locally available. However, this must be done in a way that permits reflow to track dependencies for cache key construction and invalidation. With S3, reflow could produce signed URLs to avoid needing to plumb through credentials (and to more tightly control access to external resources). In order to be careful about changing files, the URL should also include the assumed e-tag, and applications should try to honor this (e.g., by failing if there is an e-tag mismatch).

There are a number of possible ways to provide this functionality.

I think the second provides better ergonomics, though poses some challenges if you want to mix URL and local access. We could provide some ways of overriding this also.

siddharthab commented 6 years ago

I can think of three conditions for the url builtin:

  1. The file is only available in a cloud bucket and not locally on the alloc, or in the reflow cache.
  2. The file is not available in a cloud bucket, because it is an output from some other exec. 2.1. The file is available in the same alloc as this exec. 2.2. The file is not available in the alloc, but is available in the reflow cache, so it can be accessed directly from there through a signed url.

I expect it to be rare where a remote url based file is already available on the alloc because another exec was using it. But in such situations, if reflow does not handle it transparently, then option 1 will give more control to the users to optimize in the most common resource allocation and execution scenarios.