memoizing dependency lookups with a local resolve index

cosmicexplorer commented 1 year ago

What's the problem this feature will solve?

Splitting out this comment (https://github.com/pypa/pip/pull/11478#issuecomment-1653555553) into its own issue.

There remains one further avenue to investigate from #7819: amortizing/memoizing certain subproblems across separate invocations of the resolver. While I was able to identify one compartmentalized improvement in #7729, in general I still think pip (and all other package managers right now) currently spend most of their time recomputing a lot of the same constraints against each other over and over again for most hot-path use cases; for example, after identifying the subset of wheels that are compatible with a given interpreter, that information can be recorded and cached indefinitely by the local pip client to immediately jump to the subset of compatible wheels in future resolves.

Describe the solution you'd like

My vision is for pip to eventually perform most resolves entirely locally, only calling out to pypi when it's unable to find a satisfying set and it determines that a newer release of some package might be able to do so. I think this should be feasible without needing to store anything close to the magnitude of all the package contents on pypi, because cargo does that right now, and has some useful comments (https://github.com/rust-lang/cargo/blob/263d24baca913d7ace29e2f59a3254e47272fd6b/src/cargo/sources/registry/index.rs#L25-L30):

//! One important aspect of the index is that we want to optimize the "happy
//! path" as much as possible. Whenever you type `cargo build` Cargo will
//! *always* reparse the registry and learn about dependency information. This
//! is done because Cargo needs to learn about the upstream crates.io crates
//! that you're using and ensure that the preexisting `Cargo.lock` still matches
//! the current state of the world.

As detailed in that file, cargo essentially just serializes a list of the dependency relationships to json in order to publish their index, which is exactly how I implemented the persistent dependency resolution cache in #7819 that multiplied the effect of the fast-deps technique alone (https://github.com/cosmicexplorer/pip/blob/40cec94281aa9165c689dd4bb69f953e2fd5fb06/src/pip/_internal/resolution/legacy/resolver.py#L171-L304).

According to this hypothesis, while fast-deps (and now PEP 658) gives you the dependency relationships needed to iterate the resolver over a single candidate without actually downloading the many-megabyte zip file, but pip still currently has to pay the cost of network calls to move the resolver forward to evaluate each candidate, and if the candidate is discarded it has to do that again until the resolver finds a satisfying dependency set. In this file, cargo describes generating an index over another index to amortize the graph traversal performed during a cargo resolve while also being able to incorporate any updates to the index that have occurred since the last resolve. There's nothing special cargo is doing here with rust that we can't do in python; as they note:

//! Consequently, Cargo "null builds" (the index that Cargo adds to each build
//! itself) need to be fast when accessing the index. The primary performance
//! optimization here is to avoid parsing JSON blobs from the registry if we
//! don't need them. Most secondary optimizations are centered around removing
//! allocations and such, but avoiding parsing JSON is the #1 optimization.

Alternative Solutions

In pantsbuild/pex#2175 my goal for pex was to reduce i/o and computation from compressing very large amounts of third-party code we do not control. I figured out a hack to reduce that effort, but as @jsirois points out, it's often more robust and maintainable in the long term to develop standard protocols that are designed for the task at hand than rely on clever hacks to optimize existing ones. In that case, while I found it was possible to cache traditional --layout zipapp pex files with a clever hack, I learned that the --layout packed format had been designed from the ground up to enable that cacheability. Analogously, @uranusjr spent the time to develop PEP 658 to avoid the need for us to use clever hacks like fast-deps to get the information pip requires. In both cases, we were able to leverage features of the zip file format to put off standardization for a while, but we're in a better place now that more thoughtful infrastructure has been laid down.

spack currently performs an analogous process, deserializing and processing all of the dependency constraints annotated in spack package recipes upon each spack invocation (cc @tgamblin @alalazo)). In contrast, spack's main package repository is maintained as part of the spack codebase itself, so it can always expect the luxury of loading everything from the local filesystem. Still, it demonstrates that amortizing dependency graph traversal while incorporating updates from the source of truth is an intrinsic issue to package resolution, as opposed to the problem solved by fast-deps which was a temporary workaround to slowly-evolving packaging standards.

Additional context

Pip shares Cargo's assumption that each individual release of any package is immutable, so the dependency relationships that fast-deps/PEP 658 obtains from each wheel can therefore be cached and reused indefinitely after they are first computed. My hypothesis is that we can improve resolve performance by:

(short term) mirroring every lookup of package==version#checksum => [dependency requirements] that pip extracts from pypi during a normal resolve into a local database of some form, to be reused whenever pip evaluates that installation candidate in future resolves.
- As noted in pantsbuild/pex#2175, every new cache implies a tradeoff of speedup/disk space. I suspect this will be a small enough dataset to avoid that concern, though, especially if we restrict ourselves to just indexing the packages pip has seen before instead of trying to contain all of pypi at once.
- As with cargo, there are some dependency relationships that are not immutable, like editable installs or the most recent release of a package, and we would need to be able to incorporate that kind of information too.
(longer term) optimizing the local dependency index and the new resolver in tandem.
- Maybe this is the hubris of a beginner, but I am under the impression pip would be able to perform most resolutions in the blink of an eye if it didn't need to intersperse filesystem and/or network i/o between every step of the resolver's graph walk.
- The reason PEP 658 works is that it cuts out the ceremony required to access the dependency relationships for a given installation candidate--I think if we continue to pare down the stuff we cache and retain from entire wheels to just the information we need for a resolve (similarly to discussions of what to include in pip install --report), we can turn this into more and more of a compute-bound rather than IO-bound problem.

pfmoore commented 1 year ago

This sounds like a really nice idea. I haven't thought it through in any detail yet, but maybe this is something that could be implemented as a plugin/wrapper for resolvelib? As far as I know, resolvelib makes the same assumptions (candidates are immutable, and a resolve will not change unless the set of candidates involved changes) so in theory it should be possible to write a library that implements some sort of CachingResolver on top of resolvelib's BaseResolver.

This may be over-engineering things, but I'm imagining maintainability benefits from some sort of separation of concerns like this.

cosmicexplorer commented 1 year ago

I absolutely think that this is something the resolution library can and should be owning, as the cacheability is intrinsically linked to the resolution logic. I also have a hunch that this might be a case where the relative simplicity of resolvelib makes it much easier to memoize, virtualize, and amortize subproblems of the resolve, the same way writing a parser by hand lets you slap on wonky context-sensitive stuff more easily than a parser generator.

cosmicexplorer commented 1 year ago

I think one possible result might be: resolvelib defines an interface for a "local index" oracle that would allow it to speed up some operations, and users of the library can provide an implementation. For example, Signal's client library for message en/decryption defines "store" interfaces for the types of secure information it needs to access, and each FFI backend implements those stores so that their contents remain secure https://github.com/signalapp/libsignal/blob/3b7f3173cc4431bc4c6e55f6182a37229b2db6fd/rust/protocol/src/storage/traits.rs#L45.

cosmicexplorer commented 1 year ago

This may be over-engineering things, but I'm imagining maintainability benefits from some sort of separation of concerns like this.

I mentioned in the issue that spack has the same problem, and I'm absolutely hoping we can use this opportunity to make some generalizable tool instead of hoarding all the performance wins like so many projects end up doing.

In the prior work, I was able to distinguish two types of indices or caching:

"lookup" indices, which record immutable key->value pairs.
- In pip's case, this covers the set of requirements extracted from the dist's METADATA file.
  - PEP 658 and/or fast-deps make this faster, but we're still doing at least two network calls for each install candidate and that i/o needs to be completely pipelined out of the main resolve loop if at all possible.
- These take many forms; see #7729.
- No dependencies on any other data, so very easy to shard and scale.
"amortize" indices, which would be used to short-circuit specific steps of the actual resolve logic.
- These are the kinds of complex caches that often introduce bugs in distributed networks, but luckily we don't need global coherency, just local coherency, and we don't have to worry as much about chaotic failure rates.
- This is the kind of thing that I think is made easier due to the other design choices already made for resolvelib.

cosmicexplorer commented 1 year ago

Thanks again for mentioning resolvelib, @pfmoore: I think its approach makes identifying these cache points much easier. From https://github.com/sarugaku/resolvelib/blob/main/src/resolvelib/providers.py, I can immediately see that get_dependencies() could be well-served by a key-value store or "lookup" index, while identify() and is_satisfied_by() by definition should be quick enough to compute pairwise and can be ignored. However, get_preference() and find_matches() will both:

have their value invalidated by new releases (so can't be cached in a key-value store),
are nontrivial to cache in a meaningful way (as the correct response depends on the entire transitive dependency closure).

I think it would make sense to set aside the fun part (how to amortize the state-dependent resolve logic of get_preferences() and find_matches()) for now and first make a proof of concept for a key-value store to cache get_dependencies(), which can be implemented without reference to any specific resolver. The next part will get interesting.

uranusjr commented 1 year ago

Only thing I’m wondering at the moment is whether the cache should be per-artifact instead of per-package-version since technically each wheel of the same version can have different dependencies, and there are known packages that use this “feature”. These are always an edge case but making the cache per-artifact shouldn’t add much overhead in most cases so I don’t see why that’d be a bad idea anyway. A cache for get_dependencies should be pretty easy to implement (at $work I actually made a similar thing, except on top of pip and less fine-grained since our scope is narrower).

cosmicexplorer commented 1 year ago

"per-artifact" is exactly the right way to describe it! I'll use that terminology.

since technically each wheel of the same version can have different dependencies, and there are known packages that use this “feature”

Thanks so much for pointing this out! We will also need to be especially careful about documenting the assumptions we make when we begin looking at how to amortize sections of the stateful resolve process, not just caching stateless network requests. I do really appreciate how the definition of an idempotent identity for each candidate is covered by the AbstractProvider#identify() operation in resolvelib, though.

uranusjr commented 1 year ago

Not sure Im following correctly—identify doesn’t really provide a unique identity to each candidate; it’s more like for grouping things together (which candidates correspond to which requirement) so the resolver knows what it got from the provider. In Python this is just the package name (optionally the extras attached) and does not really effectively identify a candidate in a general sense. I don’t think we can change this either since the identities of a requirement and the candidates found with it are supposed to be identical for the resolver to work.

cosmicexplorer commented 1 year ago

Thanks so much for explaining that! I assumed something totally different from the name.

cosmicexplorer commented 1 year ago

I do not believe that arrangement in resolvelib will hinder this investigation.

cosmicexplorer commented 1 year ago

....so I've been getting some pretty ridiculous results. My current branch is at https://github.com/cosmicexplorer/pip/tree/integration-branch, and it incorporates #12186, #12197, #12193, #12208, and then a couple other changes to perform the types of caching discussed above. In particular, it introduces a cache for looking up metadata from a Link (this part is very small), then another (more complex) cache for parsing links from index or find-links pages:

> time python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
WARNING: pip is using lazily downloaded wheels using HTTP range requests to obtain dependency information. This experimental feature is enabled through --use-feature=fast-deps and it is not ready for production.
Collecting numpy>=1.19.5
Collecting keras==2.4.3
Collecting mtcnn
Collecting pillow>=7.0.0
Collecting bleach>=2.1.0
Collecting tensorflow-gpu==2.5.3
Collecting scipy>=0.14 (from keras==2.4.3)
Collecting pyyaml (from keras==2.4.3)
Collecting h5py (from keras==2.4.3)
Collecting numpy>=1.19.5
Collecting absl-py~=0.10 (from tensorflow-gpu==2.5.3)
Collecting astunparse~=1.6.3 (from tensorflow-gpu==2.5.3)
Collecting flatbuffers~=1.12.0 (from tensorflow-gpu==2.5.3)
Collecting google-pasta~=0.2 (from tensorflow-gpu==2.5.3)
Collecting h5py (from keras==2.4.3)
Collecting keras-preprocessing~=1.1.2 (from tensorflow-gpu==2.5.3)
Collecting opt-einsum~=3.3.0 (from tensorflow-gpu==2.5.3)
Collecting protobuf>=3.9.2 (from tensorflow-gpu==2.5.3)
Collecting six~=1.15.0 (from tensorflow-gpu==2.5.3)
Collecting termcolor~=1.1.0 (from tensorflow-gpu==2.5.3)
Collecting typing-extensions~=3.7.4 (from tensorflow-gpu==2.5.3)
Collecting wheel~=0.35 (from tensorflow-gpu==2.5.3)
Collecting wrapt~=1.12.1 (from tensorflow-gpu==2.5.3)
Collecting gast==0.4.0 (from tensorflow-gpu==2.5.3)
Collecting tensorboard~=2.5 (from tensorflow-gpu==2.5.3)
Collecting tensorflow-estimator<2.6.0,>=2.5.0 (from tensorflow-gpu==2.5.3)
Collecting keras-nightly~=2.5.0.dev (from tensorflow-gpu==2.5.3)
Collecting grpcio~=1.34.0 (from tensorflow-gpu==2.5.3)
Collecting opencv-python>=4.1.0 (from mtcnn)
Collecting webencodings (from bleach>=2.1.0)
INFO: pip is looking at multiple versions of tensorboard to determine which version is compatible with other requirements. This could take a while.
Collecting tensorboard~=2.5 (from tensorflow-gpu==2.5.3)
Collecting google-auth<3,>=1.6.3 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting google-auth-oauthlib<0.5,>=0.4.1 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting markdown>=2.6.8 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting protobuf>=3.9.2 (from tensorflow-gpu==2.5.3)
Collecting requests<3,>=2.21.0 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting setuptools>=41.0.0 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting tensorboard-data-server<0.7.0,>=0.6.0 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting tensorboard-plugin-wit>=1.6.0 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting werkzeug>=1.0.1 (from tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting cachetools<6.0,>=2.0.0 (from google-auth<3,>=1.6.3->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting pyasn1-modules>=0.2.1 (from google-auth<3,>=1.6.3->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting rsa<5,>=3.1.4 (from google-auth<3,>=1.6.3->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting urllib3<2.0 (from google-auth<3,>=1.6.3->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting importlib-metadata>=4.4 (from markdown>=2.6.8->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting charset-normalizer<4,>=2 (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting idna<4,>=2.5 (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting certifi>=2017.4.17 (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting MarkupSafe>=2.1.1 (from werkzeug>=1.0.1->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting zipp>=0.5 (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting pyasn1<0.6.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.5->tensorflow-gpu==2.5.3)
Would install Keras-2.4.3 Keras-Preprocessing-1.1.2 Markdown-3.4.4 MarkupSafe-2.1.3 Pillow-10.0.0 PyYAML-6.0.1 Werkzeug-2.3.6 absl-py-0.15.0 astunparse-1.6.3 bleach-6.0.0 cachetools-5.3.1 certifi-2023.7.22 charset-normalizer-3.2.0 flatbuffers-1.12 gast-0.4.0 google-auth-2.22.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.34.1 h5py-3.1.0 idna-3.4 importlib-metadata-6.8.0 keras-nightly-2.5.0.dev2021032900 mtcnn-0.1.1 numpy-1.19.5 oauthlib-3.2.2 opencv-python-4.8.0.76 opt-einsum-3.3.0 protobuf-3.20.3 pyasn1-0.5.0 pyasn1-modules-0.3.0 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 scipy-1.10.1 setuptools-68.0.0 six-1.15.0 tensorboard-2.11.2 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.1 tensorflow-estimator-2.5.0 tensorflow-gpu-2.5.3 termcolor-1.1.0 typing-extensions-3.7.4.3 urllib3-1.26.16 webencodings-0.5.1 wheel-0.41.1 wrapt-1.12.1 zipp-3.16.2

real    0m4.335s
user    0m2.868s
sys     0m0.143s

The link metadata caching can be seen in https://github.com/pypa/pip/compare/main...cosmicexplorer:link-metadata-cache?expand=1, it's quite small. The work to cache link parsing is a bit larger and can be seen at https://github.com/cosmicexplorer/pip/compare/link-metadata-cache..link-parsing-cache?expand=1. I will be planning to make PRs of both of those once my other work is in.

cosmicexplorer commented 1 year ago

For a sense of scale, the link metadata caching brings that command down from >50 seconds to 12-14 seconds, and caching link parsing from index pages takes that down to 4-6 seconds. I suspect we'd want to put these both behind the same feature flag, but probably add them in separate changes.

cosmicexplorer commented 1 year ago

Ok, the work from the above has been split off into #12256 and #12257 (EDIT: and #12258!)!

cosmicexplorer commented 2 months ago

This issue is now superseded by #12921, which has specific answers to all of the above discussion!!

pypa / pip