pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.51k stars 3.02k forks source link

Document the new resolver #10240

Closed pfmoore closed 2 years ago

pfmoore commented 3 years ago

What's the problem this feature will solve? Lots of people are getting frustrated with long resolve times, and want pip to "work better". There's a common issue that people don't really know how the resolver works, and in particular don't appreciate why the problem is hard, so they tend to get frustrated when we can't "just solve the issue".

Describe the solution you'd like A high-level description in the documentation (probably under "topics") of the problems the resolver faces, the unique issues that make dependency resolution for Python packages hard, and most importantly, the process we go through to find a solution.

The document doesn't need to cover everything, and absolutely shouldn't be positioned as the definitive spec, because we need to be able to change the implementation without being held to implied behaviour guarantees.

Alternative Solutions

Additional context

pfmoore commented 3 years ago

Assigned to myself because I'd like to take a stab at this. But my time is limited. If anyone else is interested and I've done nothing about it for a while, feel free to ping me and ask if you can take over.

pfmoore commented 3 years ago

Hmm, I must have been looking at an older checkout, it appears there's already a "Dependency Resolution" article in the "Topic Guides" section. But it's aimed at a much more "beginner" level than what I imagined (it starts with defining what dependency resolution is...).

My plan was to document much more technical details like

But I don't think that would fit with the existing article (anyone who finds the existing article useful would likely be overwhelmed by that level of detail, and anyone who wants that level of detail would be put off by the existing content). So I'm not sure where I should put such an article.

For now I'm going to dump what I write in a separate article "More Dependency Resolution". But I'm not happy about that as a final location. So suggestions appreciated... (As long as what I write remains a standalone piece - I do not want to try to merge what I write with existing content, that's something I won't have time for).

pfmoore commented 3 years ago

Do we have a policy on diagrams? I want to include some diagrams (probably using graphviz, maybe using mermaid for flowcharts). I'm currently just dumping the .dot source and a generated .png file (I might switch to .svg) in the docs directory alongside my Markdown source. Is that OK? Is there a better way?

uranusjr commented 3 years ago

IIRC there’s a sphinx plugin to embed dot syntax directly inside tthe doc, and the graph would be built automatically with the rest of the documentation. But that would introduce a build-time requirement on Graphviz, which is not always easy to install. I am not aware of alternative graph drawing tools.

pfmoore commented 3 years ago

Yes, that's an option (for graphviz, at least). But I didn't particularly want to add a new dependency for doc building, and I don't really mind rendering locally. Also, I rendered the current version using sketchviz, which is fun even though I'll use "proper" graphviz for the final version πŸ™‚

If we did standardise on supporting certain sphinx plugins for diagrams etc, I'd use them, but I don't want to make that part of this PR.

I did a lot of research on embedding diagrams in Markdown/ReST a while ago, and came to the conclusion that it's a frustrating mess. There's basically nothing that's universally supported, in particular no way of keeping diagram source in the document source. So external source file, plus external image file, plus "instructions on how to render the image from the source" is the best of a bad job IMO. And don't ever bother trying to use a diagram in a github comment (which was what I was originally trying to do πŸ™)

(FWIW, hackmd.io supports a huge range of inline syntaxes - maths, UML diagrams, flowcharts, graphviz, all the stuff mermaid supports, even music via abc. But of course it's non-standard and doesn't work anywhere else...).

notatallshaw commented 3 years ago

FYI as a user of Pip who wants to do the right thing, I think some important sections I would like to see are:

  1. Why can Pip takes so long? But also why on the previous version of Pip (since the new resolution engine) did my packages install quickly and on the current version of Pip it takes a long time?
  2. What can I do to help Pip be faster for my dependencies? What hints can I give Pip? (e.g. do constrains help? does ordering help? if they do how do they?) How do I go about debugging which set of dependencies are causing the lengthy resolution time?
  3. If I truly think I've found a bug in pips resolution what information do I need to give to report it?

I was thinking about step 2 earlier and it reminded me that in large relational databases it's not uncommon to have to give complex business driven SQL queries hints, this is to tell the internal optimizer ways you know it can speed up the query which it can't easily figure out itself. People are happy to do this kind of thing because they can accept that SQL queries can be arbitrarily hard and complex to solve, I think if well explained enough that Pip faces a similar problem and here are the tools to help then they would start to accept it more.

pfmoore commented 3 years ago

Thanks. I hope I'm covering at least some of "why pip takes so long". I'm not planning on saying much about why things change between versions of pip - apart from the problem of keeping such information up to date, I do not want to end up in a situation where this document acts as a focal point for people complaining that we made the wrong trade-offs.

The basic message is "package resolution is fundamentally an algorithm that has to scale badly, we apply trade-offs to try to make the common cases sufficiently fast but we're never going to be able to make everything fast".

As regards what people can do to guide pip, that should be implicit in the explanation of what we do - pip can't use information it hasn't encountered yet, and I'll describe (at a high level) the order pip sees information, so "put things you know are helpful early in that order" is the basic answer. But I want this document to remain a description of what pip does, not a guide to how to use that to tune your specific case. That would be a useful document, and would be based on the information here, but I think it should be separate. Basically it should evolve over time, so the pip docs is too static a location for it, and it should be guided by community knowkedge, so a community maintained document would be better. What I will do is include a section on "implications of pip's algorithm" which makes some of these implications explicit.

As regards reporting bugs, I'll take that into consideration. But we tend not to have a problem with people reporting actual bugs - the problem is that people report things as bugs which aren't. The fix here is to give people the information they need to know what is a bug (that's this document) and to get them to put in some effort understanding what went wrong in their case. I don't have any good answers on how to do that second one πŸ™

I agree entirely about the analogy with SQL hints. It's something we would like to do (although we've not thought of it in those terms). The problem is, we've not had enough (good quality) issue reports to really work out what would be useful "tuning knobs". The biggest one so far is restricting the number of versions considered, but we have that already, in the form of constraints files (add a constraint boto >= some.version). I'm considering looking at making constraints like this apply at the finder level, so we can guarantee to apply them early, but I don't think that should actually be necessary. I don't think anyone with a "big" resolver example has ever experimented with that, though.

pradyunsg commented 3 years ago

, it appears there's already a "Dependency Resolution" article in the "Topic Guides" section. But it's aimed at a much more "beginner" level than what I imagined (it starts with defining what dependency resolution is...).

Feel free to hijack that / tack onto the end of it / whatever. :)

I only moved what @ei8fdb had written for that section, but there's no reason we can't build upon that and replace the bits we don't think are needed.

pfmoore commented 3 years ago

Feel free to hijack that / tack onto the end of it / whatever. :)

TBH, I'd find it hard to merge the styles, and I don't feel comfortable just deleting what's there. So I'll add my stuff and let someone merge the two later if they want to.

pradyunsg commented 3 years ago

Sounds fine by me!

pfmoore commented 3 years ago

Note to myself: See https://github.com/pypa/pip/issues/10201#issuecomment-890839036, which notes that constraints are applied to the results of the finder. It's worth covering the background in this document, specifically:

We find the list of candidates once, by running the finder and applying any version constraints that are known at the time. We don't then re-fetch during the resolution process, but rather we discard what doesn't match the current requirement (which we might then backtrack through). What's "known at the time" we run the finder is basically anything from constraints files, plus the requirement that triggered the call to the finder.

(It's more complicated than that, as usual, but that's probably close enough for this level of docs).

Also add that to any "what people can do to guide pip" section.