Closed glynnfoster closed 3 years ago
It's also possible to retrieve arbitrary paths from a bucket, as in:
pin_get("https://rpins-video.s3-us-west-2.amazonaws.com/wines/data.csv", board = "s3")
I'll make one improvement over this to even support unrelated buckets. This probably needs to also be better documented since it's a not well-known feature.
Great, thanks for the tip @javierluraschi, didn't appreciate that and I assume the value of cache being TRUE or FALSE is whether it tries to download it again or not (based on S3 digest Etag?)
Actually... I'm rethinking how this works... currently pin_get()
works but is a bit clunky with URLs since you can't assign the name... I think we might want to do instead:
pin("https://rpins-video.s3-us-west-2.amazonaws.com/wines/data.csv", source = "s3", name = "wines")
Let me work on this today and properly document this functionality....
All good. I think for the most part what I want to download is from the same bucket, so using this sort of logic:
aws.signature::use_credentials()
# Register a board hosted in S3
pins::board_register_s3(name='test-bucket', bucket = 'test-bucket.montoux.com')
# Pull down our read-only data that hasn't any metadata associated with it
new_data = pins::pin_get('original_sources/sample_data.csv', cache=TRUE, board='test-bucket')
d <- read.csv(file = new_data)
# Do modelling
# Save the resulting dataframe back into S3
pins::pin(model_result, name='model_result', description='Result data', board='test-bucket')
Right, alright, let's keep it working the way it's currently working.
That said, there are a few fixes for pin_get(<url>)
that are desired:
Opened
Hey @javierluraschi, noticing some strange behaviour, that I haven't been able to successfully debug. We have an S3 bucket called example.s3.montoux.com
and I can connect to this board and see the registered pins, but can't seem to pull down a file outside this:
> print(s3_url)
[1] "poc_pipeline/customer_data/27-11-2019/yrt-total.csv"
> print(customer_s3_bucket)
[1] "example.s3.montoux.com"
> csv_file = pins::pin_get(s3_url, board=customer_s3_bucket)
Checking 'change_age' header (time, change age, max age): 1583795981.38289, 1583795981.38285, 0
Checking 'etag' (old, new): ,
Downloading http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/data.txt to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//Rtmpd2IAAn/filec3e625e99f46/data.txt
No encoding supplied: defaulting to UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchBucket</Code>
<Message>The specified bucket does not exist</Message>
<BucketName>example</BucketName>
<RequestId>BB43A9A6FEC79401</RequestId>
<HostId>3rrW6L2TfiYIYCQtA90GBN+N2/EpO4DTHmJ/dHJRV5m5UQ5/LoZ2s8A4fvdVeO9a3Q/9dLHXWag=</HostId>
</Error>
Checking 'change_age' header (time, change age, max age): 1583795981.54595, 1583795981.54592, 0
Checking 'etag' (old, new): ,
Downloading http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/yrt-total.csv to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//Rtmpd2IAAn/filec3e6cd7c01c/yrt-total.csv
No encoding supplied: defaulting to UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchBucket</Code>
<Message>The specified bucket does not exist</Message>
<BucketName>example</BucketName>
<RequestId>99B58935B0379E9D</RequestId>
<HostId>kl4VoE0rPTYD7XynH0B1Qicc3Mnls396cjHgWE2TF+LiGlpzcIcBYZGY+vftvYGIg4db18vvgJE=</HostId>
</Error>
Error in pin_download(path, name, board$name, extract = identical(extract, :
Client error: (404) Not Found. Failed to download remote file: http://example.s3.montoux.com.s3.amazonaws.com/poc_pipeline/customer_data/27-11-2019/yrt-total.csv
As you'll see in the above, the BucketName
ends up being example
rather than example.s3.montoux.com
- perhaps it's not handling bucket names with special characters?
FYI, I just rolled back to the original S3 board code and it worked fine again (https://github.com/rstudio/pins/blob/46b5d37ecabab93465cd5b515834ff178dc9a25b/R/board_s3.R)
I think I'm still a bit confused about how this should be working with S3 boards. Here's some debug output, but getting a little strange behaviour that's leading me to wonder if I'm just completely misunderstood about how this should work.
> pins::board_register_s3(name='foobar', bucket='foobar.montoux.com')
> pins::pin_get('https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv')
Pin not found, pins available in registry: sample-csv-pins
Pin not found, pins available in registry:
Checking 'change_age' header (time, change age, max age): 1589766170.00496, 1589766170.00487, 0
Checking 'etag' (old, new): ,
Downloading https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/data.txt to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//RtmpAgcIjz/file1706d32565b63/data.txt
Checking 'change_age' header (time, change age, max age): 1589766170.34868, 1589766170.34865, 0
Checking 'etag' (old, new): ,
Downloading https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv to /var/folders/ys/zjk794b149nc3j_8jbg74vv00000gn/T//RtmpAgcIjz/file1706d4389da66/chatswood-sat-10_0-ALL.csv
Found pin https://s3-ap-southeast-2.amazonaws.com/foobar.montoux.com/Montoux+Files/Chatswood/Data+Science/SAT/chatswood-sat-10_0-ALL.csv in board cigna-nz
Error: $ operator is invalid for atomic vectors
and what I see in my cache:
$ tree
.
├── foobar
│ ├── data.txt
│ ├── data.txt.lock
│ ├── https:
│ │ └── s3-ap-southeast-2.amazonaws.com
│ │ └── foobar.montoux.com
│ │ └── Montoux+Files
│ │ └── Chatswood
│ │ └── Data+Science
│ │ └── SAT
│ │ └── chatswood-sat-10_0-ALL.csv
│ │ └── data.txt
│ └── s3-ap-southeast-2.amazonaws.com
│ └── foobar.montoux.com
│ └── Montoux+Files
│ └── Chatswood
│ └── Data+Science
│ └── SAT
│ └── chatswood-sat-10_0-ALL.csv
│ ├── chatswood-sat-10_0-ALL.csv
│ └── data.txt
└── local
├── data.txt
├── data.txt.lock
└── sample-csv-pins
├── data.csv
├── data.rds
└── data.txt
18 directories, 10 files
I think I'm hoping for the following behaviour:
pin_get
but using a full URL is ok too I guess), be able to assign them convenience namespin
Does that align with the typical use case? It seems like some of the cloudr packages are being maintained, and hence why pins
seems such a great fit.
https://github.com/rstudio/pins/issues/229 was resolved so I think we can close this, if not, please reopen issue. Thanks!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
We have a use case where we'd like to run a model from existing user supplied data that's hosted in S3. Currently the only way to create a cache is to download the file with something like
aws.s3
and then pin it back to the S3 board. It would be interesting to allow the ability to populate the cache (ie. populate the metadata) from an existing object rather than duplicating it (we treat user supplied data as read-only). See https://community.rstudio.com/t/pins-and-s3-use-case-best-practice/54557 for more details. Cheers!