rstudio / vetiver-python

Version, share, deploy, and monitor models.
https://rstudio.github.io/vetiver-python/stable/
MIT License
60 stars 17 forks source link

tracking requirements in `required_pkgs` #140

Open isabelizimm opened 1 year ago

isabelizimm commented 1 year ago

This conversation is starting to get lost in #126, so bringing it over here :)

From @juliasilge

Both start out with a base level of just the packages directly required to make a prediction. This is some level of likely to work and possibly be enough, especially in R where updating to latest is basically always the right move. Then both R and Python will have an option to escalate to more robust package version tracking.

In R, we're going straight to renv, since that is the tool most people are familiar with for this type of task, a tool we have input into how it develops, etc. So there are two levels, both familiar to R users: only package names, plus opt in to full renv. In Python, the thing that is most equivalent to renv (pipfile.lock) can seem like overkill and may be less familiar to many practitioners. Instead we can use piptools to generate a requirements.txt that is pinned to specific versions and covers the whole dependency graph. So there are two levels here too, but they are different to be more comfortable for Python users: only package names, plus opt in to the piptools pinned requirements.

The general idea would be that instead of required_pkgs, there would be an argument called requirements or requirements_txt. The default would be what required_pkgs does currently: give the names of the minimal required packages to make predictions at a model's endpoint. There could be another argument that would make this minimal requirements be more robust. The top level requirements would include the version (ie, vetiver==0.1.8 and scikit-learn==1.2.0), and pip-tools would be used to find the second-level compatible version. (There is the issue with just doing pip freeze is that it will include everything in the environment, and maybe more annoyingly, is not a guarantee that the environment can be recreated.)

So,

my_vetiver_model.requirements

could output something like:

vetiver
scikit-learn

or something like below, where it is generated from a pinned vetiver==0.1.8 and scikit-learn==1.2.0:

...
requests==2.28.1
    # via
    #   pins
    #   vetiver
rfc3986[idna2008]==1.5.0
    # via httpx
rsconnect-python==1.13.0
    # via vetiver
scikit-learn==1.2.0
    # via
    #   -r /var/folders/5w/dhznpltj14n3nxr4fybjj8_w0000gn/T/tmp8p4nsqtj.in
    #   vetiver
scipy==1.9.3
    # via scikit-learn
...

CC: @machow @juliasilge

juliasilge commented 1 year ago

This is related to rstudio/vetiver-r#154

machow commented 1 year ago

Thanks for this! It feels like the end goal here can be tricky to parse from the language specific details here. For example, when comparing the R and python programs, I noticed for docker deployment, they differ how they pin package versions for users.

Can we write out somewhere the high-level rules, and cases they apply to (without any mention of technical solutions)? It'd help to hear the end result users should expect to see in terms of versions installed.

Here's a rough example (which may be wrong):

Rules for write_dockerfile

isabelizimm commented 1 year ago

Ah, that's a good way to establish what needs to happen! I think you have it mostly right, but this will mainly happen at writing/reading pins:

Rules for write_dockerfile

Purpose of this to check that the versions of the model package and vetiver are the same at pin read as when it was originally written to pin.

This should be mostly invisible to users. If people are interested in looking at this file, they are able to do so via board.pin_meta.