Closed cpcloud closed 1 year ago
cc @machow if you wouldn't mind taking a look here!
@cpcloud thanks--this looks really handy!
This is more an issue with me not having completed the implementation of type="file"
in pins, but it looks like it might not work with cloud boards. I wonder if it might be useful for pin_read()
with type="file"
to return the fsspec file directly, rather than the path? Basically pins wraps the fsspec file object in a fsspec cache class.
Here's how I tested against s3:
# see .env.dev for configuration
from dotenv import load_dotenv
load_dotenv()
from pins.tests.helpers import BoardBuilder
bb = BoardBuilder("s3")
board = bb.create_tmp_board("pins/tests/pins-compat")
board.pin_write("abcd.x.y.z", "cool_pin", type="file")
board.pin_read("cool_pin")
WDYT of reading type="file" returning the fsspec file object (which may point to the cache on disk)? Worst case scenario, the R pins library has functions called pin_upload()
and pin_download()
that implement exactly this behavior (download file and return its path on disk; https://github.com/rstudio/pins-python/issues/32)
I'm away for the week so may be a bit slow to respond, but am happy to dig in when I get back!
I ran the tests on a dev branch, since I can't remember how to get secrets into an external PR (or it's weirdly hard?),
elif meta.type == "file":
# TODO: update to handle multiple files
> return [str(Path(fs.open(path_to_file).name).absolute())]
E AttributeError: 'AzureBlobFile' object has no attribute 'name'
https://github.com/rstudio/pins-python/actions/runs/5579519223/jobs/10195188387
I realize this process is nuts, since it hits a ton of buckets, and am definitely happy to do a lot of the work here if it helps!
@machow Thanks!
I tested this with gcs and even rolled it into an ibis PR so I don't think it's a general cloud boards issue.
I'm a bit -1 on returning the result of open
since that can leak resources without a with
statement, which an end user could do, but then pin_read
would some times be a context manager and sometimes not which feels a bit awkward IMO.
My end goal is to be able get all the on disk paths for a given pin, since that is what we're passing to the various readers in ibis.
I think I may be misunderstanding what kind of thing path_to_file
is. Is it a str
that refers to path on disk, or can it be an object in cloud storage, or can it either?
When you construct a board, it creates the relevant fsspec filesystem (eg s3, gcs, local), so I think path_to_file
is read using that filesystem (fs in the load function). I can't remember if the protocol gets included in the path, or if the filesystem itself determines where path is read (eg s3).
I'm a bit -1 on returning the result of open since that can leak resources
Fair 😅! I think I've come around to just doing what this PR does, since..
cc @juliasilge wdyt? For context..
https://pins.rstudio.com/articles/pins.html#reading-and-writing-files
(I have to set this down but will pick up Monday when I'm back!)
I am also something of a -1 on pin_read
returning a path or fsspec object in the pins context, just because of how hard it is to reason about what a function or method returns when the return type depends on an argument. Like @cpcloud this strikes me as not great:
then
pin_read
would some times be a context manager and sometimes not
On the R side, this type of task is handled by pin_upload()
/ pin_download()
; notice that pin_download()
returns a vector of paths instead of the contents of the pin like pin_read()
.
Is it doable to implement pin_upload
/ pin_download
methods, perhaps returning the fsspec file object, instead of adding type = "file"
?
This article might be a relevant thing to take a look at, for how we recommend folks handle files they want to read/write in a custom way: https://pins.rstudio.com/articles/managing-custom-formats.html
Hm, I was hoping my use case isn't that custom: I don't want a DataFrame out of pin_x
, I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download
for that?
Hm, I was hoping my use case isn't that custom
Oh for sure, maybe I just made things more confusing by showing what folks can use pin_upload
/ pin_download
for.
I want the set of cached on-disk paths that make up the dataset. Should I be using
pin_download
for that?
My opinion is yes, for consistency across our language implementations of pins. Neither is implemented yet though for Python:
What do you all think of these methods returning the fsspec file directly, rather than the path, like @machow mentioned?
For my use case it would be less than ideal to get back an fsspec path, unless I can get a filename from that to the cached path on disk.
I'd like pins to be able to handle the downloading, caching, and checking for new data without having to deal with whatever cloud file system thing that ultimately backs the pin.
Sorry, to be clear what I mean is my ideal signature is this:
def pin_download(...) -> list[Path]: # or list[str]
def pin_download(...) -> list[Path]: # or list[str]
This is extremely consistent with the R pin_download
so to me, this sounds like the best way to go. 👍
I think this all makes sense! Just to be sure to recap:
pin_read()
return the paths.pin_read()
to using pin_download()
/ pin_upload()
.@cpcloud do you mind if I pick up the PR from here? (should be able to make the changes tomorrow). Since pin_download()
and pin_upload()
aren't implemented, I'll need to make some quick tweaks, but shouldn't be too bad!
Sgtm, please take it over!
Just wanted to chime in and say the proposed API would fit our needs perfectly as well, so thanks for working on this 🙏
Alright, tests are passing with basic pin_upload()
/ pin_download()
behavior.
Currently,
pin_upload()
just calls pin_write()
, which always writes the file name as {pin_name}.{suffixes}
pin_write()
into a separate method, so we can preserve the exact file namepin_download()
checks to ensure there's a cache to download to, errors otherwise.pin_read()
fails as expected for type="file"
(Will also add pin_download/upload to docs)
@isabelizimm when you get the chance, do you mind taking a look? I think the pin_download()
/ pin_upload()
behavior should be implemented! (Though we still need to document them)
@isabelizimm thanks for looking through the PR! I can chuck the methods into the docs, once https://github.com/rstudio/pins-python/pull/207 is merged!
@machow @isabelizimm Thanks for pushing this through ❤️
Thanks again for working on this!
Is there any ETA on when a release with this change will be published?
I know from experience it is annoying when people ask this 😄, but we have a use case it would be perfect for it and I am trying to figure out if I should wait for a release or pull these changes in sooner manually.
Great question @jimhester! We're planning on a release by the end of August; it's on the to-do list for @machow and I as I transition into the maintainer role. 😄
Following up here-- @jimhester pins 0.8.2 has been released!
This PR adds support for
pin_write
withtype="file"
.