pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.46k stars 3.01k forks source link

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

Open notatallshaw opened 7 months ago

notatallshaw commented 7 months ago

What's the problem this feature will solve?

Pip's dependency resolution process, particularly in complex backtracking scenarios, currently faces performance issues due to repetitive calculations, leading to O(n^2) complexity. This issue arises when checking if a "name" is part of a cause but also hinders the implementation of more complex checks, as they would slow down simple backtracking scenarios.

Describe the solution you'd like

The solution proposes adding a new abstract method, filter_unsatisfied_names, in resolvelib, to be implemented by Pip. This method is designed to work alongside the existing get_preference method, refining the backtracking process during dependency resolution.

Alternative Solutions

Additional context

I am not tied to the name filter_unsatisfied_names or even to this specific approach, as long as it's possible to filter unsatisfied names more than one at a time.

A PR is open on the resolvelib side, but it is unlikely to ever be merged without buy-in from Pip maintainers: https://github.com/sarugaku/resolvelib/pull/145

Code of Conduct

pfmoore commented 7 months ago

Two points here.

  1. I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.
  2. This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.
notatallshaw commented 7 months ago

I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.

Pip is the main driver of the resolvelib. And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers: https://github.com/sarugaku/resolvelib/issues/134. So I don't see how this would be any different.

This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.

I would gladly take another suggestion on a name, it's simply a reference to the existing variable unsatisfied_names: https://github.com/sarugaku/resolvelib/blob/main/src/resolvelib/resolvers.py#L422 Which is the name of all dependencies which do not yet have a satisfying solution.

Happy to fill out the docstring to explain that in more detail.

pfmoore commented 7 months ago

And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers

I added a comment on that issue.

As I said, it’s not about names. What I want is a user-facing explanation of the term “unsatisfied”. Something that will let future pip maintainers review and maintain this function without needing to read and understand the resolvelib code.

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

Which is the name of all dependencies which do not yet have a satisfying solution.

So, what? A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How? What is the client expected to achieve? What happens, for example, if I just return the first name? That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise the risk that they get it wrong in the first place.

notatallshaw commented 7 months ago

I added a comment on that issue.

Appreciate it!

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

As I said, I'm more than happy if someone thinks there is a more descriptive name, I am not a fan of it, but I thought anyone who had to deal directly would resolution steps would at least understand it.

A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

What is the client expected to achieve?

That the next step of the backtracking phase will only backtrack on those packages the client returned in the list.

What happens, for example, if I just return the first name?

Then the next step of the backtracking phase will only focus on that first name.

In real world performance for Pip it would affect the performance like so:

That’s filtered, isn’t it?

Yup, if that's how the client wants to focus the resolve that is completely valid. Honestly there may well be real world use cases for this approach, something about the client wanting to force resolution in the exact order they provided it without resolvelib checking get_preference on each name,

That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise 1the risk that they get it wrong in the first place.

What's new is if they fail to honor returning a non-empty subset of the name list then resolvelib will throw an exception. Otherwise the pitfalls are exactly the same as get_preference, it could cause resolution to take a potentially impractical amount of time to complete.

pfmoore commented 7 months ago

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

That's partly my fault - I was focusing on the term "unsatisfied" when my confusion is actually more with the idea of "filtering". The rest of my comment was hopefully a little clearer. But having said that, my point here is that while it may be a known term in the resolver codebase (by which I presume you mean resolvelib) it's not well-known in the pip codebase (where, in fact, "satisfied" is typically used in the sense that a requirement is satisfied by an already-installed package). I'm trying to make sure that we don't end up in a situation where understanding the internal workings (and terminology) of resolvelib is a prerequisite for working on pip's resolution code. Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference[^1], but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

Thank you. That information is important and should very definitely be made clear in the resolvelib documentation. It means that filter_unsatisfied_names is purely an optimisation, and will never change the result of the resolve() call[^2]. Until now, I wasn't aware that this was the case.

But if it's not allowed to return an empty subset, that should be explicitly noted in the documentation, as well. And I hope it will trigger an immediate exception in resolvelib, rather than just doing the wrong thing because the data is bad.

[^1]: Which is also, IMO, not explained well enough in the resolvelib docs. [^2]: Excepting, as you say, ResolutionTooDeep limits and bugs.

notatallshaw commented 7 months ago

Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

Almost certainly this is contributing the less than ideal communication, but this discussion has been very fruitful I think, it's given me specific updates I can go back and apply to the resolvelib PR so hopefully it will be easier for Pip to maintain interaction with this API in the future.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

I think the problem I have answering this question is because the genuine answer is: using a deep understanding of Python packaging, it's ecosystem, the algorithm type resolvelib employs, and SAT-algorithms.

The API I propose in the PR is almost identical to get_preference, it provides the names and a bunch of objects that reflect the current state of the resolver algorithm, and in turn those objects are themselves derived from the objects that Pip provided to resolvelib to resolve. So the answers are the same as to the question "with get_preference how would Pip know if it's a good idea to not prefer baz?".

So as examples:

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference1, but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

Agreed the name isn't intuitively unhelpful, any suggestions would be welcome: refine_resolution? narrow_backtrack_candidates?

pfmoore commented 7 months ago

Thanks. I guess my reservation is on the pip side then, as while I see that this can help a lot with resolution issues, it’s going to be critical that pip’s definition of this method is maintainable (without needing the sort of deep knowledge you possess - the pip code base is already far too inaccessible to new contributors). Let’s wait until there is a pip PR before discussing that in detail[^1].

[^1]: But for comparison, pip’s implementation of get_preference is a good example of a clear implementation that can be understood and maintained without in-depth resolver knowledge.

As to the name, narrow_backtrack_candidates is my preference out of the ones you suggested, but it’s only a mild preference. The real key is going to be having good documentation on the resolvelib side.

notatallshaw commented 7 months ago

Let’s wait until there is a pip PR before discussing that in detail1

The PR exists, https://github.com/pypa/pip/pull/12499, I will update it and the resolvelib PR some time over the next week with the feedback from here.

I raise this PR before the resolvelib one is merged, released, and vendored exactly to iron out any issues with Pip (as the main customer of resolvelib) before I push for the resolvelib one to be merged. That PR includes 3 commits: 1) Add new API with no behavior change, 2) Move some existing optimizations over to new API to get speed ups, 3) Add new optimizations (as outlined by https://github.com/pypa/pip/issues/12498) to get massive speed up in complex backtracking cases.

pfmoore commented 7 months ago

Ah, OK. I'm afraid I don't really follow the logic in the new provider method in that PR. I'll try to take a proper look another day but my first reactions are:

Beyond that, I'll wait till I have some more time to do a proper review, rather than just a "quickly dash something off while I'm making food" disjointed ramble 🙂

notatallshaw commented 7 months ago

Yeah, based on the feedback here I'm going to take quite a bit of time updating the docstrings

notatallshaw commented 4 weeks ago

This API is now in resolvelib main, this issue can be closed once resolvelib is released and vendored.

Once it is vendored I plan to break out the 3 different optimizations I have created in https://github.com/pypa/pip/pull/12499 into their own PRs that I will post one at a time. In particular because the first one, moving prioritising backtrack causes from get_preference to narrow_requirement_selection should be relatively uncontroversial, whereas the other optimizations will require more justification.