ropensci / targets

Function-oriented Make-like declarative workflows for R
https://docs.ropensci.org/targets/
Other
934 stars 73 forks source link

Trouble declaring the region of an AWS bucket #681

Closed caewok closed 3 years ago

caewok commented 3 years ago

Description

I am using Backblaze S3 compatible bucket. For this to work, two things need to happen:

  1. The aws.s3 package must be the newest version, 0.3.22, which can be installed using install.packages("aws.s3","http://rforge.net/",type="source").
  2. Either options("cloudyr.aws.default_region" = "") must be set or region = "" must be passed to functions like get_bucket.

The issue I am running into is that the packages function tar_resources_aws is ignoring the cloudyr setting and instead trying to add an erroneous region to the S3 request.

Reproducible example

Take the S3 example from the targets documentation. Assume we have a bucket, which I will call here "LoanModeling," and the following _targets.R file:

library(targets)

tar_option_set(
  resources = tar_resources(
    aws = tar_resources_aws(bucket = "LoanModeling")
  )
)

write_mean <- function(data) {
  tmp <- tempfile()
  writeLines(as.character(mean(data)), tmp)
  tmp
}

list(
  tar_target(data, rnorm(5), format = "aws_qs"),
  tar_target(mean_file, write_mean(data), format = "aws_file")
)

With aws.s3 installed, I can access this Backblaze bucket using either:

get_bucket("LoanModeling", region = "")

or

options("cloudyr.aws.default_region" = "") 
get_bucket("LoanModeling")

But if I call tar_make() for the _targets.R file, I get an error:

tar_make() • start target data List of 3 $ Code : chr "NoSuchBucket" chr "The specified bucket does not exist: us-east-1" $ Resource: chr "us-east-1" List of 6, "headers")= chr "1368658afb3f672b" chr "adZ9uNWvQbnBvjXffbhM=" chr "max-age=0, no-cache, no-store" chr "application/xml": ..$ content-length : chr "208" chr "Sat, 30 Oct 2021 01:36:49 GMT" chr [1:2] "insensitive" "list" chr "aws_error"s")= chr "GET\n/LoanModeling/\nlocation=\nhost:us-east-1.s3.us-west-002.backblazeb2.com\nx-amz-content-sha256:e3b0c44298f"| truncated chr "AWS4-HMAC-SHA256\n20211030T013650Z\n20211030/us-east-1/s3/aws4_request\n764b78731047038bd97785044b7322be68a981a"| truncated chr "AWS4-HMAC-SHA256 Credential=[AWS_ACCESS_KEY_ID]/20211030/us-east-1/s3/aws4_request,SignedHeaders=host;x-a"| truncated NULL x error target data

Notice how the request appears to be adding an erroneous us-east-1 region.

Diagnostic information

sessionInfo() R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04 LTS

Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] targets_0.8.1 aws.s3_0.3.22

loaded via a namespace (and not attached): [1] pillar_1.6.4 compiler_4.0.4 prettyunits_1.1.1 base64enc_0.1-3 remotes_2.4.1
[6] tools_4.0.4 testthat_3.1.0 digest_0.6.28 pkgbuild_1.2.0 pkgload_1.2.3
[11] tibble_3.1.5 memoise_2.0.0 lifecycle_1.0.1 pkgconfig_2.0.3 rlang_0.4.12
[16] igraph_1.2.7 rstudioapi_0.13 cli_3.1.0 yaml_2.2.1 curl_4.3.2
[21] xfun_0.27 fastmap_1.1.0 knitr_1.36 withr_2.4.2 httr_1.4.2
[26] xml2_1.3.2 vctrs_0.3.8 desc_1.4.0 fs_1.5.0 devtools_2.4.2
[31] tidyselect_1.1.1 rprojroot_2.0.2 data.table_1.14.2 glue_1.4.2 R6_2.5.1
[36] processx_3.5.2 fansi_0.5.0 sessioninfo_1.1.1 callr_3.7.0 purrr_0.3.4
[41] magrittr_2.0.1 codetools_0.2-18 ps_1.6.0 ellipsis_0.3.2 usethis_2.1.3
[46] aws.signature_0.6.0 renv_0.14.0 utf8_1.2.2 tinytex_0.34 cachem_1.0.6
[51] crayon_1.4.1

wlandau commented 3 years ago

Have you set the AWS_DEFAULT_REGION environment variable? If not, you could try that or set the cloudyr.aws.default_region option either in a project-level .Rprofile file or _targets.R. (Although, the latter might not work with tar_make_future() or tar_make_clustermq() if storage = "worker" because workers would need the option too.)

wlandau commented 3 years ago

To be clear, there is currently no way to enter the region into tar_resources_aws(), so this is more of a limitation than a bug. But maybe targets should allow different buckets to have different regions via resources.

caewok commented 3 years ago

If I set the AWS_DEFAULT_REGION environmental variable, nothing changes.

If I set options("cloudyr.aws.default_region" = "") in a .Rprofile then get_bucket("LoanModeling") works (as expected) but tar_make() still does not work. The error changes, however. Either targets or an underlying package is trying to pull us-west-002 out of the endpoint and set it as the region. The error is:

$ Message : chr "The specified bucket does not exist: us-west-002"

The issue is the correct endpoint is s3.us-west-002.backblazeb2.com but tar_make is using us-west-002.s3.us-west-002.backblazeb2.com.

If I add options("cloudyr.aws.default_region" = "") to _targets.R, I get the same issue as above, with it trying to use us-west-002 as a region.

wlandau commented 3 years ago

@caewok, please have a look at #682. The new region argument to tar_resources_aws() is now forwarded to aws.s3::object_exists() and other aws.s3 functions, and you can set tar_resources_aws(region = ""). Hopefully that solves your issue.

For future reference, most of the code in targets for interacting with AWS is at https://github.com/ropensci/targets/blob/main/R/class_aws.R and https://github.com/ropensci/targets/blob/main/R/class_aws_file.R.

caewok commented 3 years ago

Thanks for the quick reply! Pulled the new update from GitHub; unfortunately it still does not work. (I confirmed it is v.0.8.1.9000.) When I run tar_make(), I still get:

• start target data List of 3 chr "NoSuchBucket" $ Message : chr "The specified bucket does not exist: us-west-002" chr "us-west-002"

  • attr(, "headers")=List of 6 chr "ab1a04c1c1008704" chr "adQduc2uDbutvLHfobg0=" chr "max-age=0, no-cache, no-store" ..$ content-type : chr "application/xml" chr "212"ent-length : chr "Fri, 05 Nov 2021 00:02:49 GMT" ..- attr(, "class")= chr [1:2] "insensitive" "list" chr "aws_error"s")= chr "PUT\n/LoanModeling/_targets/objects/data\n\nhost:us-west-002.s3.us-west-002.backblazeb2.com\nx-amz-acl:private\"| truncated chr "AWS4-HMAC-SHA256\n20211105T000249Z\n20211105/us-west-002/s3/aws4_request\nfe887c358f886efcdf178c649e94dd86b68b6"| truncated chr "AWS4-HMAC-SHA256 Credential=[AWS_ACCESS_KEY_ID]/20211105/us-west-002/s3/aws4_request,SignedHeaders=host;x"| truncated NULL

The top of my _targets.R file specifies region should be left blank:

tar_option_set(
  resources = tar_resources(
    aws = tar_resources_aws(bucket = "LoanModeling", region = "")
  )
)

But the error message suggests that something is still overriding region = "" with region = "us-west-002". I don't have a good explanation for why this is happening. It could be that region <- store$resources$aws$region %|||% store$resources$region in store_produce_aws_path() is doing that. Or, more likely, targets is passing the bucket and region name to aws.s3 in a manner that is causing aws.s3 to replace the region and constructing the URL incorrectly: "us-west-002.s3.us-west-002.backblazeb2.com" instead of "s3.us-west-002.backblazeb2.com."

As before, get_bucket("LoanModeling") works because options("cloudyr.aws.default_region" = "") is set. Putting options("cloudyr.aws.default_region" = "") into the _targets.R file still has no effect.

caewok commented 3 years ago

I have a theory on this:

store_upload_object.tar_aws and related functions call region <- store_aws_region(store$file$path). But if I understand correctly, the path to be parsed in this case should come from targets:::store_produce_aws_metabucket("LoanModeling", "") and in that case, will be "bucket=LoanModeling:region=". So then, if you call targets:::store_aws_region("bucket=LoanModeling:region=") the result is NULL, causing the region to revert back to some default like "us-west-002" instead of "". So some part of that chain, probably store_produce_aws_metabucket, needs to correctly handle when region is set to "".

wlandau commented 3 years ago

Yeah, that's probably right. Would you have a look at e19624651fde0f86b86a853746f361158b8d780f? I can only create conventional AWS buckets because I do not have a Backblaze subscription.

caewok commented 3 years ago

I tried e196246 but no change. But I think we are on the right track. The next piece of the puzzle appears to be the calls to aws.s3::object_exists, aws.s3::head_object, aws.s3::put_object, and aws.s3::save_object. Those calls all set check_region = TRUE while the default is FALSE.

For example, if I run this code, I get back the expect TRUE value:

aws.s3::object_exists(
  object = "[Object path in s3 bucket]",
  bucket = "LoanModeling",
  region = "",
  check_region = FALSE
  )

While setting check_region to TRUE throws a 404 error and returns FALSE:

aws.s3::object_exists(
  object = "[Object path in s3 bucket]",
  bucket = "LoanModeling",
  region = "",
  check_region = TRUE
  )

check_region appears to be used by aws.s3::s3HTTPS. The description for that parameter tells me:

check_region | A logical indicating whether to check the value of region against the apparent bucket region. This is useful for avoiding (often confusing) out-of-region errors. Default is FALSE.

And the region parameter says:

region | A character string containing the AWS region. Ignored if region can be inferred from bucket. If missing, an attempt is made to locate it from credentials. Defaults to “us-east-1” if all else fails. Should be set to "" when using non-AWS endpoints that don't include regions (and base_url must be set).

So I think what is happening is that setting check_region toTRUE is overriding the region parameter. So even if you pass "" correctly to region, you also have to set check_region to FALSE. Otherwise, s3HTTPS goes searching for a valid region, sees "us-west-002" in the URL, and assumes (incorrectly) that it should be the region.

I assume you have good reasons for wanting to normally set check_region = TRUE. So perhaps only set it to FALSE if region is ""?

wlandau commented 3 years ago

check_region = TRUE came from https://github.com/ropensci/targets/issues/400. I think we can set check_region to TRUE if and only if region is NULL. (IMO aws.s3 should do this already, but maintenance of that package has slowed down.)

wlandau commented 3 years ago

Please try 7f724cd79c220baccf7c7fc9aa85f28c5d123985.

caewok commented 3 years ago

Yep, that fixed it! Thanks!