rstudio / pins-python

https://rstudio.github.io/pins-python/
MIT License
46 stars 10 forks source link

feat: add support for `pin_write` with `type="file"` #201

Closed cpcloud closed 11 months ago

cpcloud commented 11 months ago

This PR adds support for pin_write with type="file".

cpcloud commented 11 months ago

cc @machow if you wouldn't mind taking a look here!

machow commented 11 months ago

@cpcloud thanks--this looks really handy!

This is more an issue with me not having completed the implementation of type="file" in pins, but it looks like it might not work with cloud boards. I wonder if it might be useful for pin_read() with type="file" to return the fsspec file directly, rather than the path? Basically pins wraps the fsspec file object in a fsspec cache class.

Here's how I tested against s3:

# see .env.dev for configuration
from dotenv import load_dotenv
load_dotenv()

from pins.tests.helpers import BoardBuilder
bb = BoardBuilder("s3")

board = bb.create_tmp_board("pins/tests/pins-compat")

board.pin_write("abcd.x.y.z", "cool_pin", type="file")
board.pin_read("cool_pin")

WDYT of reading type="file" returning the fsspec file object (which may point to the cache on disk)? Worst case scenario, the R pins library has functions called pin_upload() and pin_download() that implement exactly this behavior (download file and return its path on disk; https://github.com/rstudio/pins-python/issues/32)

I'm away for the week so may be a bit slow to respond, but am happy to dig in when I get back!

machow commented 11 months ago

I ran the tests on a dev branch, since I can't remember how to get secrets into an external PR (or it's weirdly hard?),

        elif meta.type == "file":
            # TODO: update to handle multiple files
>           return [str(Path(fs.open(path_to_file).name).absolute())]
E           AttributeError: 'AzureBlobFile' object has no attribute 'name'

https://github.com/rstudio/pins-python/actions/runs/5579519223/jobs/10195188387

I realize this process is nuts, since it hits a ton of buckets, and am definitely happy to do a lot of the work here if it helps!

cpcloud commented 11 months ago

@machow Thanks!

I tested this with gcs and even rolled it into an ibis PR so I don't think it's a general cloud boards issue.

I'm a bit -1 on returning the result of open since that can leak resources without a with statement, which an end user could do, but then pin_read would some times be a context manager and sometimes not which feels a bit awkward IMO.

My end goal is to be able get all the on disk paths for a given pin, since that is what we're passing to the various readers in ibis.

I think I may be misunderstanding what kind of thing path_to_file is. Is it a str that refers to path on disk, or can it be an object in cloud storage, or can it either?

machow commented 11 months ago

When you construct a board, it creates the relevant fsspec filesystem (eg s3, gcs, local), so I think path_to_file is read using that filesystem (fs in the load function). I can't remember if the protocol gets included in the path, or if the filesystem itself determines where path is read (eg s3).

I'm a bit -1 on returning the result of open since that can leak resources

Fair 😅! I think I've come around to just doing what this PR does, since..

cc @juliasilge wdyt? For context..

https://pins.rstudio.com/articles/pins.html#reading-and-writing-files

(I have to set this down but will pick up Monday when I'm back!)

juliasilge commented 11 months ago

I am also something of a -1 on pin_read returning a path or fsspec object in the pins context, just because of how hard it is to reason about what a function or method returns when the return type depends on an argument. Like @cpcloud this strikes me as not great:

then pin_read would some times be a context manager and sometimes not

On the R side, this type of task is handled by pin_upload() / pin_download(); notice that pin_download() returns a vector of paths instead of the contents of the pin like pin_read().

Is it doable to implement pin_upload / pin_download methods, perhaps returning the fsspec file object, instead of adding type = "file"?

juliasilge commented 11 months ago

This article might be a relevant thing to take a look at, for how we recommend folks handle files they want to read/write in a custom way: https://pins.rstudio.com/articles/managing-custom-formats.html

cpcloud commented 11 months ago

Hm, I was hoping my use case isn't that custom: I don't want a DataFrame out of pin_x, I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

juliasilge commented 11 months ago

Hm, I was hoping my use case isn't that custom

Oh for sure, maybe I just made things more confusing by showing what folks can use pin_upload / pin_download for.

I want the set of cached on-disk paths that make up the dataset. Should I be using pin_download for that?

My opinion is yes, for consistency across our language implementations of pins. Neither is implemented yet though for Python:

https://github.com/rstudio/pins-python/blob/7ded06943c6ea3236e5da4443a3869073d66aeb1/pins/boards.py#L347

What do you all think of these methods returning the fsspec file directly, rather than the path, like @machow mentioned?

cpcloud commented 11 months ago

For my use case it would be less than ideal to get back an fsspec path, unless I can get a filename from that to the cached path on disk.

I'd like pins to be able to handle the downloading, caching, and checking for new data without having to deal with whatever cloud file system thing that ultimately backs the pin.

cpcloud commented 11 months ago

Sorry, to be clear what I mean is my ideal signature is this:

def pin_download(...) -> list[Path]: # or list[str]

juliasilge commented 11 months ago
def pin_download(...) -> list[Path]: # or list[str]

This is extremely consistent with the R pin_download so to me, this sounds like the best way to go. 👍

machow commented 11 months ago

I think this all makes sense! Just to be sure to recap:

@cpcloud do you mind if I pick up the PR from here? (should be able to make the changes tomorrow). Since pin_download() and pin_upload() aren't implemented, I'll need to make some quick tweaks, but shouldn't be too bad!

cpcloud commented 11 months ago

Sgtm, please take it over!

jimhester commented 11 months ago

Just wanted to chime in and say the proposed API would fit our needs perfectly as well, so thanks for working on this 🙏

machow commented 11 months ago

Alright, tests are passing with basic pin_upload() / pin_download() behavior.

Currently,

(Will also add pin_download/upload to docs)

machow commented 11 months ago

@isabelizimm when you get the chance, do you mind taking a look? I think the pin_download() / pin_upload() behavior should be implemented! (Though we still need to document them)

machow commented 11 months ago

@isabelizimm thanks for looking through the PR! I can chuck the methods into the docs, once https://github.com/rstudio/pins-python/pull/207 is merged!

cpcloud commented 11 months ago

@machow @isabelizimm Thanks for pushing this through ❤️

jimhester commented 10 months ago

Thanks again for working on this!

Is there any ETA on when a release with this change will be published?

I know from experience it is annoying when people ask this 😄, but we have a use case it would be perfect for it and I am trying to figure out if I should wait for a release or pull these changes in sooner manually.

isabelizimm commented 10 months ago

Great question @jimhester! We're planning on a release by the end of August; it's on the to-do list for @machow and I as I transition into the maintainer role. 😄

isabelizimm commented 10 months ago

Following up here-- @jimhester pins 0.8.2 has been released!