Introduce new API in resolvelib to Optimize Pip's Dependency Resolution

pypa / pip

The Python package installer

https://pip.pypa.io/

MIT License

9.46k stars 3.01k forks source link

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

Open notatallshaw opened 7 months ago

notatallshaw commented 7 months ago

What's the problem this feature will solve?

Pip's dependency resolution process, particularly in complex backtracking scenarios, currently faces performance issues due to repetitive calculations, leading to O(n^2) complexity. This issue arises when checking if a "name" is part of a cause but also hinders the implementation of more complex checks, as they would slow down simple backtracking scenarios.

Describe the solution you'd like

The solution proposes adding a new abstract method, filter_unsatisfied_names, in resolvelib, to be implemented by Pip. This method is designed to work alongside the existing get_preference method, refining the backtracking process during dependency resolution.

filter_unsatisfied_names pre-filters unsatisfied names before they are processed by get_preference.
Unlike get_preference, which is called once for each name, filter_unsatisfied_names is called only once per backtracking round.
All unsatisfied names are provided to filter_unsatisfied_names, in contrast to get_preference, which receives one name at a time.
Hence filter_unsatisfied_names is able to short-circuiting checks.
And filter_unsatisfied_names can determine if multiple names are important to backtrack on simultaneously.

Alternative Solutions

Enhancing get_preference: Improvements to this method could offer some benefits, but it would struggle to enable complex backtracks without imposing performance penalties on simple backtracks.
Implementing Caching: While caching can reduce redundant work, it might lead to increased memory usage and added complexity.

Additional context

I am not tied to the name filter_unsatisfied_names or even to this specific approach, as long as it's possible to filter unsatisfied names more than one at a time.

A PR is open on the resolvelib side, but it is unlikely to ever be merged without buy-in from Pip maintainers: https://github.com/sarugaku/resolvelib/pull/145

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

pfmoore commented 7 months ago

Two points here.

I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.
This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.

notatallshaw commented 7 months ago

I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.

Pip is the main driver of the resolvelib. And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers: https://github.com/sarugaku/resolvelib/issues/134. So I don't see how this would be any different.

This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.

I would gladly take another suggestion on a name, it's simply a reference to the existing variable unsatisfied_names: https://github.com/sarugaku/resolvelib/blob/main/src/resolvelib/resolvers.py#L422 Which is the name of all dependencies which do not yet have a satisfying solution.

Happy to fill out the docstring to explain that in more detail.

pfmoore commented 7 months ago

And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers

I added a comment on that issue.

As I said, it’s not about names. What I want is a user-facing explanation of the term “unsatisfied”. Something that will let future pip maintainers review and maintain this function without needing to read and understand the resolvelib code.

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

Which is the name of all dependencies which do not yet have a satisfying solution.

So, what? A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How? What is the client expected to achieve? What happens, for example, if I just return the first name? That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise the risk that they get it wrong in the first place.

notatallshaw commented 7 months ago

I added a comment on that issue.

Appreciate it!

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

As I said, I'm more than happy if someone thinks there is a more descriptive name, I am not a fan of it, but I thought anyone who had to deal directly would resolution steps would at least understand it.

A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

What is the client expected to achieve?

That the next step of the backtracking phase will only backtrack on those packages the client returned in the list.

What happens, for example, if I just return the first name?

Then the next step of the backtracking phase will only focus on that first name.

In real world performance for Pip it would affect the performance like so:

For complicated backtracks - Likely there will be a lot more resolution steps
In simple backtracks - Probably not too much difference, likely minor performance benefit from get_preference only being called once per resolution step, instead of upto n
In some random cases much much faster due to luck that being a good way to solve that set of requirements

That’s filtered, isn’t it?

Yup, if that's how the client wants to focus the resolve that is completely valid. Honestly there may well be real world use cases for this approach, something about the client wanting to force resolution in the exact order they provided it without resolvelib checking get_preference on each name,

That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise 1the risk that they get it wrong in the first place.

What's new is if they fail to honor returning a non-empty subset of the name list then resolvelib will throw an exception. Otherwise the pitfalls are exactly the same as get_preference, it could cause resolution to take a potentially impractical amount of time to complete.

pfmoore commented 7 months ago

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

That's partly my fault - I was focusing on the term "unsatisfied" when my confusion is actually more with the idea of "filtering". The rest of my comment was hopefully a little clearer. But having said that, my point here is that while it may be a known term in the resolver codebase (by which I presume you mean resolvelib) it's not well-known in the pip codebase (where, in fact, "satisfied" is typically used in the sense that a requirement is satisfied by an already-installed package). I'm trying to make sure that we don't end up in a situation where understanding the internal workings (and terminology) of resolvelib is a prerequisite for working on pip's resolution code. Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference[^1], but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

Thank you. That information is important and should very definitely be made clear in the resolvelib documentation. It means that filter_unsatisfied_names is purely an optimisation, and will never change the result of the resolve() call[^2]. Until now, I wasn't aware that this was the case.

But if it's not allowed to return an empty subset, that should be explicitly noted in the documentation, as well. And I hope it will trigger an immediate exception in resolvelib, rather than just doing the wrong thing because the data is bad.

[^1]: Which is also, IMO, not explained well enough in the resolvelib docs. [^2]: Excepting, as you say, ResolutionTooDeep limits and bugs.

notatallshaw commented 7 months ago

Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

Almost certainly this is contributing the less than ideal communication, but this discussion has been very fruitful I think, it's given me specific updates I can go back and apply to the resolvelib PR so hopefully it will be easier for Pip to maintain interaction with this API in the future.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

I think the problem I have answering this question is because the genuine answer is: using a deep understanding of Python packaging, it's ecosystem, the algorithm type resolvelib employs, and SAT-algorithms.

The API I propose in the PR is almost identical to get_preference, it provides the names and a bunch of objects that reflect the current state of the resolver algorithm, and in turn those objects are themselves derived from the objects that Pip provided to resolvelib to resolve. So the answers are the same as to the question "with get_preference how would Pip know if it's a good idea to not prefer baz?".

So as examples:

You may want to prefer a pinning requirement over a non-pinning requirement
It's important to know that visiting new candidates could be highly expensive, if not cached Pip may have to download gigantic wheels or build sdists
resolvelib employs it's important to understand it's DFS during backtracking and choices must be made in that context
It's helpful to know that Python packaging specifiers are unusual and doing static analysis varies significantly from any common literature on SAT-solvers
Having an understanding of CDCL SAT helps (I find these steps and diagrams particularly intuitive: https://en.wikipedia.org/wiki/Conflict-driven_clause_learning#Example) as ideas from it can most easily be applied to resolvelib's approach

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference1, but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

Agreed the name isn't intuitively unhelpful, any suggestions would be welcome: refine_resolution? narrow_backtrack_candidates?

pfmoore commented 7 months ago

Thanks. I guess my reservation is on the pip side then, as while I see that this can help a lot with resolution issues, it’s going to be critical that pip’s definition of this method is maintainable (without needing the sort of deep knowledge you possess - the pip code base is already far too inaccessible to new contributors). Let’s wait until there is a pip PR before discussing that in detail[^1].

[^1]: But for comparison, pip’s implementation of get_preference is a good example of a clear implementation that can be understood and maintained without in-depth resolver knowledge.

As to the name, narrow_backtrack_candidates is my preference out of the ones you suggested, but it’s only a mild preference. The real key is going to be having good documentation on the resolvelib side.

notatallshaw commented 7 months ago

Let’s wait until there is a pip PR before discussing that in detail1

The PR exists, https://github.com/pypa/pip/pull/12499, I will update it and the resolvelib PR some time over the next week with the feedback from here.

I raise this PR before the resolvelib one is merged, released, and vendored exactly to iron out any issues with Pip (as the main customer of resolvelib) before I push for the resolvelib one to be merged. That PR includes 3 commits: 1) Add new API with no behavior change, 2) Move some existing optimizations over to new API to get speed ups, 3) Add new optimizations (as outlined by https://github.com/pypa/pip/issues/12498) to get massive speed up in complex backtracking cases.

pfmoore commented 7 months ago

Ah, OK. I'm afraid I don't really follow the logic in the new provider method in that PR. I'll try to take a proper look another day but my first reactions are:

The filter_unsatisfied_names method is documented in terms of "prefer backtracking on..." which is jarring to me. I'd much rather the docstring and comments be worded in terms of what the code is filtering out (and why it's doing so) and not leave it to the reader to try to work out the link from backtracking to filtering.
I can't understand the supporting functions. This is probably because they make heavy use of the PreferenceInformation data, which is in need of documentation. I'm going to hope for now that the resolvelib PR will address this, and properly review those functions in light of whatever additional documentation gets added.

Beyond that, I'll wait till I have some more time to do a proper review, rather than just a "quickly dash something off while I'm making food" disjointed ramble 🙂

notatallshaw commented 7 months ago

Yeah, based on the feedback here I'm going to take quite a bit of time updating the docstrings

notatallshaw commented 4 weeks ago

This API is now in resolvelib main, this issue can be closed once resolvelib is released and vendored.

Once it is vendored I plan to break out the 3 different optimizations I have created in https://github.com/pypa/pip/pull/12499 into their own PRs that I will post one at a time. In particular because the first one, moving prioritising backtrack causes from get_preference to narrow_requirement_selection should be relatively uncontroversial, whereas the other optimizations will require more justification.