Environment caching - Githubissues

knthmn commented 3 years ago

I have a number of scripts for my personal use and I found pip-run very useful for managing my dependencies. However it still takes several seconds for pip to install the packages, even if it already uses the download cache of pip.

I propose adding an option (e.g. --cache) where the directory containing the packages is not deleted after the program exits, and it can be reused as long as the dependencies list does not change. Similar tools (e.g. kotlin-main-kts) also have caching that makes rerunning a script practically free.

There are several things that need to be considered

The cached environments need to be managed, since they take up disk space (e.g. numpy takes up around 70MiB). A simple solution can be keep using /tmp and let the system manages it.
Identification of a cached environment. I think using the hash of the dependency list should work.

I think it is best to keep the functionality simple, and in the worst case just destroy and recreate the environment. But I do expect it to be able to avoid recreation of the environment if I run the same script twice.

knthmn commented 3 years ago

Please tell me your opinion, if you like the idea I can work on it.

jaraco commented 3 years ago

Hi @knthmn . Thanks for the proposal. I've long bemoaned the sluggish performance, but put up with it for the simplicity of the implementation. In particular, one of the big advantages of pip-run is that it's stateless, so leaves little behind to be cleaned up.

I've similarly thought about ways pip-run could somehow optimize the performance.

The design you describe aligns closely with what I would expect. I'd tweak it slightly thus:

Instead of installing to the temp dir, if re-usable installs are indicated, I'd use a well-known 'cache' directory ($XDG_CACHE_HOME/pip-run/XXXX), rather than relying on temp.
Rather than --cache, I'd suggest --reuse or --persist, which more closely aligns with the pip-run usage and is more distinct from pip install parameters. Although,
I'm uncertain that a command-line parameter will be the most useful. I'd like for users not to have to add a parameter to every invocation. Perhaps it would be better to have an environment indicator (env var, config file, directory presence, ...) to signal to use the re-use behavior. Or perhaps both could be available.
Another possibility could be to expose a separate entry point for this behavior, such as pip-rerun or pip-sprint.

I do think producing a hash of the dependency list may prove more difficult than it sounds on two dimensions:

First, across invocations, available dependencies can change. For example, pip-run requests today may pull one version, whereas pip-run requests tomorrow could pull a newer (or even older in case of a yank/delete) version. One of the advantages of pip-run is that its invocation is fairly independent of the environment. Adding this state could make it more difficult to anticipate what the behavior of pip-run would be for a given user's environment.
Second, pip-run doesn't resolve dependencies, but only passes them to pip install. So pip-run -r requirements.txt is very different depending on the context. I wouldn't expect pip-run to have the same behavior if the contents of that file changed or if the current directory changed.

I'd like some thoughts on how you'd propose to address those concerns.

If we can come up with some reasonable behaviors for these concerns and come up with an implementation that is fairly clean (doesn't introduce too many touch points), I'd be inclined to move forward with it.

knthmn commented 3 years ago

Thank you for the response.

I agree on using $XDG_CACHE_HOME and --persist
I assumed the user would have shebang'd the script anyway. However I agree that more choices should be given to the user. I am personally not a fan of multiple entry points (like useradd and usermod) since I think they pollute the command namespace and hurt discoverability.

As for the identity of an environment

I didn't state it clearly, I was thinking of identifying environments by the requirements given by the user, not the resolved list of dependencies. Thus if it is invoked with requests, then it would keep using the same environment whether requests is updated or not. This roughly corresponds pip install -r requirements.txt without specifying the exact version.
A user can lock the dependency by requests==<some_version> to prevent their scripts from breaking. This would create another environment from requests even if they happen to resolve to the same version.

I wanted to have this because if each script has their own environment, their size can bubble up fairly quickly. I have also thought of linking the dependencies but it sounds too complicated and might not even work.

jaraco commented 2 years ago

I wonder - does pipx run achieve what you're seeking here? I wonder if it's better to let pipx handle more permanent environments and pip-run to always handle ephemeral ones.

pfmoore commented 2 years ago

I would also like this feature. For my use cases, I would be fine with identifying environments by the requirements as stated by the user. I agree that managing environments is hard, but to be honest, that's why I want the tool to handle it for me. I'd be perfectly OK with a simple initial implementation, with improvements being added based on real-world experience. (I can list off things I'd like to have, but I don't know in advance how necessary they are, and if the list is too long I imagine it would simply make the feature look too complex to accept).

For my own use cases (typically pip-run SCRIPT with the dependencies defined in the script) I don't see how pipx run would help, as it needs the script to be packaged into a standard distribution. Although thinking a bit "outside the box", if there was a tool that built a simple wheel from a script (with embedded dependencies), without needing a full "project directory" then that might make "packaging the script" simple enough to work smoothly with pipx run. I'll be back, I'm going to see if I can make something like that 🙂

agoose77 commented 2 years ago

@pfmoore that actually isn't a terrible idea - what if we could extend pipx with a plugin system that lets the user define different kinds of spec (not sure if this should be --spec). E.g.

# TOML below
# dependencies: ["numpy", ...]

import numpy

The pipx part is that these plugins would just take a spec and build a wheel, which is then given to pipx, meaning that the syntax is a per-plugin thing.

jaraco commented 1 year ago

@knthmn I updated your comment to replace -q requests with simply requests. The -q is a separate, unrelated parameter that just means "be quiet when installing" (and in pip-run 9, is unnecessary).

I still don't think there's a good answer to the concern about how to cache an environment if requirements are passed as a requirements file. Let me illustrate:

 draft $ cat > requirements.txt
tempora
 draft $ pip-run -q -r requirements.txt -- -c pass
 draft $ cat > requirements.txt
numpy
 draft $ pip-run -q -r requirements.txt -- -c pass
 draft $

In those two invocations, the set of dependencies available in each environment is very different, but the parameters to run-pip are identical. Moreover, the behavior of run-pip is identical. The fact that requirements.txt changed between invocations only affects the underlying pip install call.

So the question is - if pip-run is to cache environments based on the inputs from the user, how does it distinguish the first invocation from the second? I see a few options:

Treat the invocations as the same and let the second invocation get the cached environment from the first. This model will provide a fast experience but sacrifice accuracy anytime requirement files are used (numpy would be missing in the second invocation).
Only cache environments when requirements files aren't used. This approach will result in performance improvements for other environments where specific dependencies are declared (including script-inline dependencies).
Parse out all requirements.txt files (any files to -r/--requirement) and integrate those requirements with other specified requirements. As pip-run doesn't currently parse requirement files, it will need to implement that behavior and the integration logic, duplicating as closely as possible the behavior found (but not exposed) in pip.
Include the hash of any requirements file as part of the cache key (so any change to a requirements file would result in a new environment).

I'm leaning toward the last option.

jaraco / pip-run

Environment caching #52