rstudio / pins-r

Pin, Discover and Share Resources
https://pins.rstudio.com
Other
301 stars 62 forks source link

Populate cache from existing S3 object #188

Closed glynnfoster closed 3 years ago

glynnfoster commented 4 years ago

We have a use case where we'd like to run a model from existing user supplied data that's hosted in S3. Currently the only way to create a cache is to download the file with something like aws.s3 and then pin it back to the S3 board. It would be interesting to allow the ability to populate the cache (ie. populate the metadata) from an existing object rather than duplicating it (we treat user supplied data as read-only). See https://community.rstudio.com/t/pins-and-s3-use-case-best-practice/54557 for more details. Cheers!

javierluraschi commented 4 years ago

It's also possible to retrieve arbitrary paths from a bucket, as in:

pin_get("https://rpins-video.s3-us-west-2.amazonaws.com/wines/data.csv", board = "s3")

I'll make one improvement over this to even support unrelated buckets. This probably needs to also be better documented since it's a not well-known feature.

glynnfoster commented 4 years ago

Great, thanks for the tip @javierluraschi, didn't appreciate that and I assume the value of cache being TRUE or FALSE is whether it tries to download it again or not (based on S3 digest Etag?)

javierluraschi commented 4 years ago

Actually... I'm rethinking how this works... currently pin_get() works but is a bit clunky with URLs since you can't assign the name... I think we might want to do instead:

pin("https://rpins-video.s3-us-west-2.amazonaws.com/wines/data.csv", source = "s3", name = "wines")

Let me work on this today and properly document this functionality....

glynnfoster commented 4 years ago

All good. I think for the most part what I want to download is from the same bucket, so using this sort of logic:

aws.signature::use_credentials()

# Register a board hosted in S3
pins::board_register_s3(name='test-bucket', bucket = 'test-bucket.montoux.com')
# Pull down our read-only data that hasn't any metadata associated with it
new_data = pins::pin_get('original_sources/sample_data.csv', cache=TRUE, board='test-bucket')
d <- read.csv(file = new_data)
# Do modelling
# Save the resulting dataframe back into S3
pins::pin(model_result, name='model_result', description='Result data', board='test-bucket')
javierluraschi commented 4 years ago

Right, alright, let's keep it working the way it's currently working.

That said, there are a few fixes for pin_get(<url>) that are desired:

Opened

glynnfoster commented 4 years ago

Hey @javierluraschi, noticing some strange behaviour, that I haven't been able to successfully debug. We have an S3 bucket called example.s3.montoux.com and I can connect to this board and see the registered pins, but can't seem to pull down a file outside this:

> print(s3_url)
[1] "poc_pipeline/customer_data/27-11-2019/yrt-total.csv"
> print(customer_s3_bucket)
[1] "example.s3.montoux.com"
> csv_file = pins::pin_get(s3_url, board=customer_s3_bucket)
Checking 'change_age' header (time, change age, max age): 1583795981.38289, 1583795981.38285, 0
Checking 'etag' (old, new): ,
Downloading http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/data.txt to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//Rtmpd2IAAn/filec3e625e99f46/data.txt
No encoding supplied: defaulting to UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
  <Code>NoSuchBucket</Code>
  <Message>The specified bucket does not exist</Message>
  <BucketName>example</BucketName>
  <RequestId>BB43A9A6FEC79401</RequestId>
  <HostId>3rrW6L2TfiYIYCQtA90GBN+N2/EpO4DTHmJ/dHJRV5m5UQ5/LoZ2s8A4fvdVeO9a3Q/9dLHXWag=</HostId>
</Error>
Checking 'change_age' header (time, change age, max age): 1583795981.54595, 1583795981.54592, 0
Checking 'etag' (old, new): ,
Downloading http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/yrt-total.csv to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//Rtmpd2IAAn/filec3e6cd7c01c/yrt-total.csv
No encoding supplied: defaulting to UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
  <Code>NoSuchBucket</Code>
  <Message>The specified bucket does not exist</Message>
  <BucketName>example</BucketName>
  <RequestId>99B58935B0379E9D</RequestId>
  <HostId>kl4VoE0rPTYD7XynH0B1Qicc3Mnls396cjHgWE2TF+LiGlpzcIcBYZGY+vftvYGIg4db18vvgJE=</HostId>
</Error>
Error in pin_download(path, name, board$name, extract = identical(extract,  :
  Client error: (404) Not Found. Failed to download remote file: http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/yrt-total.csv

As you'll see in the above, the BucketName ends up being example rather than example.s3.montoux.com - perhaps it's not handling bucket names with special characters?

glynnfoster commented 4 years ago

FYI, I just rolled back to the original S3 board code and it worked fine again (https://github.com/rstudio/pins/blob/46b5d37ecabab93465cd5b515834ff178dc9a25b/R/board_s3.R)

glynnfoster commented 4 years ago

I think I'm still a bit confused about how this should be working with S3 boards. Here's some debug output, but getting a little strange behaviour that's leading me to wonder if I'm just completely misunderstood about how this should work.

> pins::board_register_s3(name='foobar', bucket='foobar.montoux.com')
> pins::pin_get('https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv')
Pin not found, pins available in registry: sample-csv-pins
Pin not found, pins available in registry: 
Checking 'change_age' header (time, change age, max age): 1589766170.00496, 1589766170.00487, 0
Checking 'etag' (old, new): , 
Downloading https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/data.txt to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//RtmpAgcIjz/file1706d32565b63/data.txt
Checking 'change_age' header (time, change age, max age): 1589766170.34868, 1589766170.34865, 0
Checking 'etag' (old, new): , 
Downloading https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//RtmpAgcIjz/file1706d4389da66/chatswood-sat-10_0-ALL.csv
Found pin https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv in board cigna-nz
Error: $ operator is invalid for atomic vectors

and what I see in my cache:

$ tree
.
├── foobar
│   ├── data.txt
│   ├── data.txt.lock
│   ├── https:
│   │   └── s3-ap-southeast-2.amazonaws.com
│   │       └── foobar.montoux.com
│   │           └── Montoux+Files
│   │               └── Chatswood
│   │                   └── Data+Science
│   │                       └── SAT
│   │                           └── chatswood-sat-10_0-ALL.csv
│   │                               └── data.txt
│   └── s3-ap-southeast-2.amazonaws.com
│       └── foobar.montoux.com
│           └── Montoux+Files
│               └── Chatswood
│                   └── Data+Science
│                       └── SAT
│                           └── chatswood-sat-10_0-ALL.csv
│                               ├── chatswood-sat-10_0-ALL.csv
│                               └── data.txt
└── local
    ├── data.txt
    ├── data.txt.lock
    └── sample-csv-pins
        ├── data.csv
        ├── data.rds
        └── data.txt

18 directories, 10 files

I think I'm hoping for the following behaviour:

Does that align with the typical use case? It seems like some of the cloudr packages are being maintained, and hence why pins seems such a great fit.

javierluraschi commented 3 years ago

https://github.com/rstudio/pins/issues/229 was resolved so I think we can close this, if not, please reopen issue. Thanks!

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.