ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
934 stars 73 forks source link

Platform-agnostic S3 protocol #769

Closed wlandau closed 2 years ago

wlandau commented 2 years ago

@davidkretch, you mentioned today that paws could support the S3 protocol on other cloud platforms like GCP. Would you or @adambanker be willing to walk me through that? It would really help keep targets down to a maintainable size as the number of cloud platforms increases.

targets currently uses paws to manage AWS S3 data using these basic utility functions, the most aspects of which are version IDs and multipart uploads. If this same code can work on e.g. GCP, that would be amazing. I am willing to refactor it to let the user supply a paws::s3() object.

@MarkEdmondson1234, I apologize if this invalidates your PRs #722 and #748.

(Note to self: if this works out, I should rename "aws" to "s3" in the code base, functions, and arguments, with smooth deprecation of course.)

wlandau commented 2 years ago

ref #720

wlandau commented 2 years ago

Is it something to do with the endpoints string mentioned at https://paws-r.github.io/docs/s3/?

MarkEdmondson1234 commented 2 years ago

I'm not sure this will be less work as there will be many edge cases but paws is using a similar approach as the Google discovery api to generate the endpoints that are exposed in gargle via the build functions that I guess would be involved https://gargle.r-lib.org/reference/request_develop.html

For user facing functions I favour putting sugar on top to make them easier to use, but it would be great to have a universal cloud bucket package. It was discussed previously in cloudyr but didn't turn into anything.

davidkretch commented 2 years ago

Yes, you can use Google Cloud Storage through Paws using an endpoint AND you must create an access key in Google Cloud Storage -> Settings -> Interoperability. I think that is probably one limitation that a native package would not require.

Connecting to Google Cloud Storage will look like the following, which I just tested successfully. I can't say what the edge cases will be however.

gcs <- paws::s3(config = list(
  endpoint = "https://storage.googleapis.com",
  region = "auto",
  credentials = list(
    creds = list(
      access_key_id = "GOOGABCDEFGHIJKLMNOP",
      secret_access_key = "abcdefghijklmnopqrstuvwxyz"
    )
  )
))
gcs$list_buckets()
davidkretch commented 2 years ago

In addition, unfortunately Microsoft Azure does not support the S3 API at all, so that is not an option sadly.

wlandau commented 2 years ago

Thanks Mark and David, very helpful to know the level at which S3 on GCP is and is not magic. In your opinion, to what degree are cloud services converging on S3 as a common standard? If I make S3 the only cloud storage in targets, how likely is that limitation to resolve itself in the long run?

davidkretch commented 2 years ago

I think the same abstractions will likely work for any of the cloud blob storage providers, so I think it's safe to plan for S3 (+ Google Cloud Storage) now. But I think Azure will eventually take work to support natively, behind a future S3-to-Azure translation layer. As far as I can tell Microsoft doesn't plan to support the S3 API.

My evidence for that is that 1) Google Cloud Storage already supports the S3 API, 3) while Azure Blob Storage does not support the S3 API, people have used a software proxy to communicate with their Azure buckets using the S3 API, so it must be possible to translate the S3 API's operations into equivalent Azure operations.

wlandau commented 2 years ago

paws::s3() on GCP almost works, except I cannot get version IDs for objects in version-enabled buckets. Example HEAD output:

$Location
[1] "http://storage.googleapis.com/targets-test-bucket-aaaabbbbcccc/x"

$Bucket
[1] "targets-test-bucket-aaaabbbbcccc"

$Key
[1] "x"

$Expiration
character(0)

$ETag
[1] "\"1b7b109a0572ae5c55551f673d3417c7-1\""

$ServerSideEncryption
character(0)

$VersionId
character(0)

$SSEKMSKeyId
character(0)

$BucketKeyEnabled
logical(0)

$RequestCharged
character(0)
wlandau commented 2 years ago

But the "generation" ID is somewhere in the object metadata, right? @davidkretch, is there a way to tell paws to return all the object metadata and not just the metadata that the package thinks is relevant to AWS?

wlandau commented 2 years ago

Still interested in discussing a solution, but I am closing this page as an issue because it seems outside the control of targets. With #803, it should be easier to add GCP as a special case using @MarkEdmondson1234's utility functions.