rstudio / pins-python

https://rstudio.github.io/pins-python/
MIT License
46 stars 10 forks source link

Export packages used by fsspec to interact with backends #190

Closed isabelizimm closed 1 year ago

isabelizimm commented 1 year ago

👋 When using pins + vetiver to deploy a Dockerfile, it is necessary to collect all the requirements to authenticate to a board. Right now, it is not possible to determine what packages are used provide that authentication.

Some more context from: https://github.com/rstudio/vetiver-python/issues/165

It looks like this is happening since vetiver is unaware of the backend pins is connecting to, so it is not tracking the requirements necessary to authenticate. Looking at the equivalent's of these packages in the R language, pins-r side has a required_pkgs method for boards, then vetiver-r aggregates those with the modeling requirements when people run vetiver_prepare_docker().

machow commented 1 year ago

Thanks for raising--can you clarify what you mean by requirements? The requirement in pins is that this code returns a backend:

import fsspec

# gs for google cloud storage, but swap in protocol for any target backend
fsspec.filesystem("gs")

We test against specific versions (gcsfs, s3fs, etc..), but these packages don't need to be installed to run (for example, you don't need s3fs to connect to S3, just some package that provides the fsspec entrypoint for s3).

machow commented 1 year ago

It seems like there are two key differences from pins R:

We could include the names of the packages we use / test against, but it seems like at that point, it might be better to keep in vetiver? Or we could store it in pins options.extra_requires in setup.cfg, and read that from the metadata?

Alternatively, we can do some metadata digging to figure it out, based on the package a fsspec backend object comes from, but it could get gnarly.

Metadata digging

For example, suppose I run pip install plum-dispatch. This gives me a package called plum. I can sort of reverse from the name "plum" back to plum-dispatch distribution info:

from importlib import metadata

# "plum-dispatch"
dist_name = metadata.packages_distributions()["plum"][0]
meta = metadata.distribution(dist_name)

meta.files

However, because the distribution may have come from somewhere besides PyPI, it's not really guaranteed to be pip installable. If you pip install a package from github, inside the meta.files you'll see a file called direct_url.json that describes where it came from. But I have no idea how consistent that is across tools etc... (and could be very wrong about all of this).

edit: If you want install recommendations, fsspec.registry.known_implementations seems like a good place to look?

isabelizimm commented 1 year ago

Thanks for this very comprehensive write-up!

can you clarify what you mean by requirements?

So, the particular instance that inspired this is: when vetiver creates a Dockerfile that connects to a pins board, it generates a requirements.txt for all the necessary packages to communicate to that board and deploy the stored model from the vetiver model's required_pkgs metadata. Right now, the requirements.txt will include pins, but not, say, gcsfs, since it is unknown what packages are necessary to authenticate to the board.

When the Dockerfile is run without gcsfs in its requirements.txt, users receive errors saying they need to install gcsfs. I was hoping we would be able to have something like board.required_pkgs that would include gcsfs to install this package without users needing to manipulate the requirements.txt themselves.

edit: If you want install recommendations, fsspec.registry.known_implementations seems like a good place to look?

The known_implementations was lead was super helpful. It looks like boards from board.fs.protocol could be matched up to fsspec.registry.known_implementations pretty easily! I'm happy to add this into vetiver if it seems to niche of a use case to have in pins.

isabelizimm commented 1 year ago

I tried out a possible implementation in the vetiver package, and it doesn't feel too clunky over there! It is pretty similar to board_deparse, but returning values from fsspec.registry.known_implementations. I'm okay closing this out and having the functionality in vetiver instead :D

https://github.com/rstudio/vetiver-python/pull/166

isabelizimm commented 1 year ago

Closing this, as the change will be made in vetiver.