rstudio / pins-python

https://rstudio.github.io/pins-python/
MIT License
52 stars 12 forks source link

feat: add support for `pin_write` with `type="file"` #201

Closed cpcloud closed 1 year ago

cpcloud commented 1 year ago

This PR adds support for pin_write with type="file".

cpcloud commented 1 year ago

cc @machow if you wouldn't mind taking a look here!

machow commented 1 year ago

@cpcloud thanks--this looks really handy!

This is more an issue with me not having completed the implementation of type="file" in pins, but it looks like it might not work with cloud boards. I wonder if it might be useful for pin_read() with type="file" to return the fsspec file directly, rather than the path? Basically pins wraps the fsspec file object in a fsspec cache class.

Here's how I tested against s3:

# see .env.dev for configuration
from dotenv import load_dotenv
load_dotenv()

from pins.tests.helpers import BoardBuilder
bb = BoardBuilder("s3")

board = bb.create_tmp_board("pins/tests/pins-compat")

board.pin_write("abcd.x.y.z", "cool_pin", type="file")
board.pin_read("cool_pin")

WDYT of reading type="file" returning the fsspec file object (which may point to the cache on disk)? Worst case scenario, the R pins library has functions called pin_upload() and pin_download() that implement exactly this behavior (download file and return its path on disk; https://github.com/rstudio/pins-python/issues/32)

I'm away for the week so may be a bit slow to respond, but am happy to dig in when I get back!

machow commented 1 year ago

I ran the tests on a dev branch, since I can't remember how to get secrets into an external PR (or it's weirdly hard?),

        elif meta.type == "file":
            # TODO: update to handle multiple files
>           return [str(Path(fs.open(path_to_file).name).absolute())]
E           AttributeError: 'AzureBlobFile' object has no attribute 'name'

https://github.com/rstudio/pins-python/actions/runs/5579519223/jobs/10195188387

I realize this process is nuts, since it hits a ton of buckets, and am definitely happy to do a lot of the work here if it helps!

cpcloud commented 1 year ago

@machow Thanks!

I tested this with gcs and even rolled it into an ibis PR so I don't think it's a general cloud boards issue.

I'm a bit -1 on returning the result of open since that can leak resources without a with statement, which an end user could do, but then pin_read would some times be a context manager and sometimes not which feels a bit awkward IMO.

My end goal is to be able get all the on disk paths for a given pin, since that is what we're passing to the various readers in ibis.

I think I may be misunderstanding what kind of thing path_to_file is. Is it a str that refers to path on disk, or can it be an object in cloud storage, or can it either?

machow commented 1 year ago

When you construct a board, it creates the relevant fsspec filesystem (eg s3, gcs, local), so I think path_to_file is read using that filesystem (fs in the load function). I can't remember if the protocol gets included in the path, or if the filesystem itself determines where path is read (eg s3).

I'm a bit -1 on returning the result of open since that can leak resources

Fair 😅! I think I've come around to just doing what this PR does, since..

cc @juliasilge wdyt? For context..

https://pins.rstudio.com/articles/pins.html#reading-and-writing-files

(I have to set this down but will pick up Monday when I'm back!)

juliasilge commented 1 year ago

I am also something of a -1 on pin_read returning a path or fsspec object in the pins context, just because of how hard it is to reason about what a function or method returns when the return type depends on an argument. Like @cpcloud this strikes me as not great:

then pin_read would some times be a context manager and sometimes not

On the R side, this type of task is handled by pin_upload() / pin_download(); notice that pin_download() returns a vector of paths instead of the contents of the pin like pin_read().

Is it doable to implement pin_upload / pin_download methods, perhaps returning the fsspec file object, instead of adding type = "file"?

juliasilge commented 1 year ago

This article might be a relevant thing to take a look at, for how we recommend folks handle files they want to read/write in a custom way: https://pins.rstudio.com/articles/managing-custom-formats.html

cpcloud commented 1 year ago

Hm, I was hoping my use case isn't that custom: I don't want a DataFrame out of pin_x, I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

juliasilge commented 1 year ago

Hm, I was hoping my use case isn't that custom

Oh for sure, maybe I just made things more confusing by showing what folks can use pin_upload / pin_download for.

I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

My opinion is yes, for consistency across our language implementations of pins. Neither is implemented yet though for Python:

https://github.com/rstudio/pins-python/blob/7ded06943c6ea3236e5da4443a3869073d66aeb1/pins/boards.py#L347

What do you all think of these methods returning the fsspec file directly, rather than the path, like @machow mentioned?

cpcloud commented 1 year ago

For my use case it would be less than ideal to get back an fsspec path, unless I can get a filename from that to the cached path on disk.

I'd like pins to be able to handle the downloading, caching, and checking for new data without having to deal with whatever cloud file system thing that ultimately backs the pin.

cpcloud commented 1 year ago

Sorry, to be clear what I mean is my ideal signature is this:

def pin_download(...) -> list[Path]: # or list[str]

juliasilge commented 1 year ago
def pin_download(...) -> list[Path]: # or list[str]

This is extremely consistent with the R pin_download so to me, this sounds like the best way to go. 👍

machow commented 1 year ago

I think this all makes sense! Just to be sure to recap:

@cpcloud do you mind if I pick up the PR from here? (should be able to make the changes tomorrow). Since pin_download() and pin_upload() aren't implemented, I'll need to make some quick tweaks, but shouldn't be too bad!

cpcloud commented 1 year ago

Sgtm, please take it over!

jimhester commented 1 year ago

Just wanted to chime in and say the proposed API would fit our needs perfectly as well, so thanks for working on this 🙏

machow commented 1 year ago

Alright, tests are passing with basic pin_upload() / pin_download() behavior.

Currently,

(Will also add pin_download/upload to docs)

machow commented 1 year ago

@isabelizimm when you get the chance, do you mind taking a look? I think the pin_download() / pin_upload() behavior should be implemented! (Though we still need to document them)

machow commented 1 year ago

@isabelizimm thanks for looking through the PR! I can chuck the methods into the docs, once https://github.com/rstudio/pins-python/pull/207 is merged!

cpcloud commented 1 year ago

@machow @isabelizimm Thanks for pushing this through ❤️

jimhester commented 1 year ago

Thanks again for working on this!

Is there any ETA on when a release with this change will be published?

I know from experience it is annoying when people ask this 😄, but we have a use case it would be perfect for it and I am trying to figure out if I should wait for a release or pull these changes in sooner manually.

isabelizimm commented 1 year ago

Great question @jimhester! We're planning on a release by the end of August; it's on the to-do list for @machow and I as I transition into the maintainer role. 😄

isabelizimm commented 1 year ago

Following up here-- @jimhester pins 0.8.2 has been released!