Docs: Specify the types of requirement pip supports (internal details)

What's the problem this feature will solve? At the moment, there are a lot of subtle edge cases in what pip will accept as a "requirement" that are not well documented and/or are special cased in the code. There are a lot of adhoc terms (e.g., "direct requirement" and "unnamed requirement") used in the code, and it is sometimes difficult to track down precise meanings.

Describe the solution you'd like Ideally, there should be a section in the documentation (or somewhere) that classifies all of the various "types" of thing that pip can be asked to install, and how they map to specific "requirement types" - part of the goal here should be to establish more consistent terminology, and have a central place where these terms are explained.

As noted above, terms like "direct requirement" and "unnamed requirement" would be good examples of terms that need clarification. Also, adding more clarity around how the requirement types pip handles map back to PEP 440 concepts would make it easier to reason about pip's standards compliance.

Alternative Solutions Continue to document such details in the code, and/or infer behaviour from the code.

Additional context When building the new resolver, the interface will necessarily be based around passing a set of requirements from pip to the resolver. At the moment, a pip RequirementSet contains a lot of internal specifics, which would not make sense to include in an API that could be passed to a general resolver. Ideally the resolver API could be specified purely in terms of PEP 440 concepts (specifiers of the form NAME OP VERSION and direct references, NAME @ URL), but to do that we would need a clear mapping between pip's concepts and the PEP's.

This issue is intended as a place for discussion on how to model pip's idea of requirements in a standards-compliant way.

@pradyunsg and I came up with a concrete list of requirement types here, from #6607 (comment). That was later turned into a graph here (from #6607 (comment)).

IMO if the source code is aligned with that model then it will be essentially self-documenting and internal documentation would be redundant (and could easily go out of date, being separate from the source code).

Initial "brain dump" of concepts, mapping what you can supply to pip to PEP 440.

Basic requirement (NAME OP VERSION) - maps directly.
Local filename (wheel or sdist) - maps to NAME @ URL. May require filename<->URL translation, so will likely need a decision on what is the canonical form (URLs for Windows pathnames are a bit funky). Extracting NAME from a wheel or sdist filename is well-defined (for a wheel, by spec, for a sdist, by convention) but we need to take account of when the internal metadata and the filename don't match (an error, but we should report it in a user friendly manner).
URLs - similar to local filenames, will translate to direct references. We may need to require that the NAME @ part is included (we can convert the #egg= syntax automatically) as URLs can't necessarily be assumed to have a conventional filename structure (although why not? they should point to a sdist or wheel, surely?)
Local filename (archive of a source tree) - similar to wheel or sdist, but no metadata and no convention on filename. This is likely where unnamed requirements hit us - we can't even determine the name without building.
Local directory name - similar to local filename, with the additional complication that converting a directory name to a URL will probably introduce more problems than it's worth.

Requirements may have additional properties, like "editable", or "ignore binaries/sources". Logically these are independent of requirement types, although some combinations don't make sense ("ignore sources" with a local directory for example).

@pradyunsg and I came up with a concrete list of requirement types

Nice! I'll browse that in more detail. But what I'd like to add to this (in terms of what I'm thinking about) is a mapping between pip's types and PEP 440. Specifically, for the resolver, if we can work on a PEP 440 basis:

NAME OP VERSION is the basic form for any resolver, so should be easy to manage.
NAME @ URL is basically "force install exactly this file for NAME, and deal with the consequences", which should be a fairly straightforward operation to ask a resolver to handle (modulo the usual problem that we need to build it to work out the version we end up with).

Just having two fundamental "types of thing" for a resolver to handle would be a huge simplification, and I think it's something we should at least try to aim for, at least at a high level.

IMO if the source code is aligned with that model then it will be essentially self-documenting

Well, we should include the graph in our docs somewhere, at least - if only to help out people like me who can't hold the full object graph in our heads 😉

Are we trying to simplify the parsing of the requirements or what the resolver has to deal with (or something else :smile:)?

From our previous investigation, as long as the abstraction that we presented was the same the resolver could be agnostic to how we decided to represent or split up the requirements. For example here are implementations of the different types from the previous graph, but here is the only interface that anything outside the projects package would need to see.

The net effect is that the work is spread out and isolated between the individual requirement classes. Any change can be targeted and it's easy to understand the outcome.

I'm trying to get my head around what the new resolver would "feel like", starting from the (string) input the user provides. So parsing, I guess, although less about how to actually parse stuff (which is just details) and more about what we should parse it to.

I was taking the view that we really only want two "types" of input to the resolver (mapping to the 2 types in PEP 440) - actual version specifiers that can match multiple possibilities, and "install this thing I gave you" specifiers (direct URLs, in effect) where there's nothing to resolve as such, but the resolver needs to know that it's getting that result and it can't do anything about it. Essentially the resolver is logically taking a mix of version specs and direct URLs, and returning a set that only contains direct URLs (in an abstract sense).

I was thinking of the installable objects that pip works with as downstream from the resolver, because they are way more complex (we need to distinguish stuff like VCS URLs, editable installs, etc). And build-stage objects have to have an exact version by that point, so they seem fundamentally different than version constraints.

Mostly, I'm trying to get back up to speed after all the recent refactorings that I've not been following, and getting to grips with how the resolver fits in (which is an area I looked at briefly when doing PEP 517, but mostly ignored if I could). And being me, I'm trying to see how much of the accumulated complexity is really needed, and how much we can get rid of 🙂

I'll follow up on the pointers you've given and see how that affects what I'm thinking about.

Hmm, checking up on ResolveLib, which was on my TODO list of things to read up on, it looks like it fleshes out some of these abstractions already.

I was walked through the various requirement states by @pradyunsg last week (using the same graph mentioned by @chrahunt), and we reached similar conclusions. pip allows the user to input some strange (from the resolver’s perspective) requirement specification formats, but they can all be normalised into one of the two PEP 440 variants.

In terms of backtracking or PubGrub resolvers (it seems we are moving into that direction, at least for our initial efforts), the two variants can both be further processed into a set of targets (ResolveLib and Zazo call it candidates) to choose from (for direct URLs the set would contain exactly one element).

One particular case of interest here is that when the resolver sees a package that’s specified by both a version range and a URL. IIRC the original idea from the PEP 440 authors (I may be wrong) is to treat it as a hard conflict, but real world users seem to want more heuristic around it. This remains an area we need to research more about.

re: resolver abstractions
- current resolver: don't look at it at all. It's... workable but that's about it.
- resolvelib has the abstractions that'll help think about the resolver's needs and "facilities". Do take a look at the "provider" in there, since we'd basically be implementing a "pip-specific provider".
- mixology extends that to include some more points for interaction with the resolution algorithm.
re: "unnamed requirement" -- local directories, random archives with no clear structure in their name, directly given sdists (because we don't look too closely at their filenames).
re: "direct requirement" -- specified "directly", i.e. a URL or a path. Usually unnamed, except for PEP 508's name @ URL syntax.
re: combinations of various options -- yea, they probably don't make sense but do note that pip would still allow them in some senses and doesn't spew out an error. I think it's likely we'll notice the cases like that when working on this and we should definitely plug those holes up IMO.

WARNING: The following wall of text took a not-reasonable-amount-of-time to write and I ended up sleeping at 2am because I was writing this.

So... I think I'm just going to use this comment as a dumping ground for my understanding/thoughts on this entire topic, because why not. It'll come in handy for laughing/crying while looking at later, and is likely also gonna be a good starting point for some architecture documentation work that I'm supposed to do later. More importantly, hopefully it'll be a useful starting point for further discussion.

The things that happen before dependency resolution are:

CLI, configuration et al.
- user inputs are processed.
Initial "requirement set" is created.

The dependency resolution algorithm would see only 2 things: a "requirement" or a "candidate". If it's a candidate, skip to just before (2) in the next paragraph. When it sees a requirement, it'll ask for all the potential candidates that are available, to satisfy the requirement (1). Then, it'll choose the candidate and get it's dependencies (2). Each dependency is either a requirement or a candidate. It'd then recurse further into the dependencies (3). If it sees a requirement for which it has already chosen a candidate, it'll check to see that the choice is compatible. If it isn't, BACKTRACK a choice (4). Repetitively do this, until we have a solution and we're good. If it takes too many tries, we'll give up and ask the user to try with a smaller set of packages or help the resolver (5). :)

After dependency resolution:

things are built into wheels if needed.
everything is installed.

Okay, I'll be honest, from here on, everything is named horribly. Where it's not clear, assume I'm using Terminology from resolvelib. Try to follow along and if what is written doesn't make sense, it's probably because I'm writing this after 1am. Please feel free to correct me where I'm wrong or ask for clarifications.

There's a lot to unpack here. Prior to the actual dependency resolution, we'd need to determine:

What is the requirement set that we care about in "proper resolution"?
- Surely we include already installed packages, right. That's sort of the whole point of this exercise.
What are these represented as? How do we choose for these requirements?
- There's more discussion in (1), so snipped from here.

With that out of the way, let's elaborate on each of those numbered points we have there:

(1)

This is basically the step which involves all the index / "explore" interactions. In pip's codebase, all of this is handled by PackageFinder (yay refactoring by @cjerdonek and @uranusjr).

A "direct requirement" would be a reqirement based on a URL/file -- pip install https://example.com/pip-19.3.1.tar.gz -- it'll have exactly 1 candidate. An "unnamed requirement" is when there's no unambiguous indicator of what the name of the package is -- pip install . is an example -- these are also direct requirements (but not all direct requirements are unnamed). "Direct requirements" will 100% need to be special cased, especially because of how we treat direct URLs when they're dependencies (we don't support them when they're originating from PyPI). We don't have a good way of representing these in PEP 440's model (we want <unnamed> @ URL or something basically).

The main function to care about here is PackageFinder.find_best_candidate. It returns a BestCandidateResult that has an applicable_candidates attribute which is a list of potential "candidates" that we'd want to use in our dependency resolution. The nice thing here is that it's ordered as per preference, i.e. higher versions come before lower versions¹ which means the resolution would just pick from one end of the list. In terms of pip's "provider" abstraction to the resolver, we'd basically want to wrap PackageFinder.find_best_candidate().applicable_candidates with objects that we'd pass into the resolution algorithm.

Another quirk here is that we have "upgrade strategies" in pip -- which aren't well named :) -- which means that the order of preference we get from PackageFinder isn't enough. We'd want to special case how installed versions/variants get represented and ordered. The main thing: if an installed version satisfies the requirement and the strategy allows for that, we should avoid hitting the network.

As an aside, if you're finding some of the names here confusing, it's not your fault. It's a bit of a mess (we discussed a potential way to solve a while back -- in https://github.com/pypa/pip/issues/7031) but no one has come around to cleaning it yet. Oh, and the names get completely non-intuitive after this. :)

(2)

This part, is a... to put it mildly... a mess. :)

Basically, all of this stage is handled by RequirementPreparer. The "prepare" here stands for "download" + "unpack" + "get metadata". [see detour 2]

The task to do, is to get the dependencies of a candidate. That information is in the metadata for the candidate, which means we need to download the "file" (distribution) and get information from it. Luckily, RequirementPreparer actually does a decent job of this -- see end of https://github.com/pypa/pip/issues/7317#issuecomment-552064205 -- which would give us an pip._internal.distributions.base.AbstractDistribution subclass. At this point, the "download" + "unpack" + "generate metadata" steps are done. Then, it's a matter of calling the methods of the AbstractDistribution and we're done here.

This step basically boils down to wrapping for the resolver (like the previous step) the following snippet:

# All these calls need arguments, skipping them for brevity
abstract_dist = requirement_preparer.prepare(candidate.underlying_install_requirement)
pkg_resources_distribution = abstract_dist.get_pkg_resources_distribution()
dependencies = pkg_resources_distribution.requires()

(LOL names)

That's about it, in terms of what's needed for the resolver. Time for detours now and why I call it all a mess.

[detour 1] Going a bit into the weeds, the exact steps for metadata generation internally + quirks are as follows:

The candidate is a wheel somewhere
- Download the wheel, load the metadata into memory² and use that.
- quirk: In theory, if it's on PyPI, we could use PyPI's JSON API to directly get the metadata; skipping the wheel download which can be a big benefit.
- It doesn't expose data all the time: only if the first file for release was a wheel and only metadata from that wheel.
  - Poorly done uploads / wheel builds are a problem here.
The candidate is an sdist somewhere
- The first steps are to download and unpack the sdist. Easy enough.
- Then, we go into the rabbit hole of how pip's metadata generation works.
- Let's gloss over choose "legacy" vs "pep 517" code path processes here -- things happen there, they do what they need to.
  - We're not touching that.
  - I highly recommend not thinking about it too hard.
- If we're using the "legacy" code path, then:
- invoke setup.py egg_info with specific arguments to the setup.py
  - quirk: people depend on this being called, for things like conditionally exhibiting dependencies to pip, detecting environment packages to determine what's going to be packaged and, in some cases the version of the package itself (yay setuptools_scm).
  - Basically, all kinds of weird things happen here that we don't want to touch. If it breaks, people throw the sharp pieces at us in frustration.
- pip-egg-info directory now contains the required metadata -- fetch it and use it.
- If we're using the "pep 517" code path, then:
- Setup environment, fetch dependencies, install them.
  - quirk: recursive build dependencies.
- Check if the prepare_metadata_for_build_wheel attribute exists on the backend. If yes, run it, use that metadata.
- If not, build the wheel using build_wheel and get the metadata from that wheel.
  - quirk: Do we cache this wheel? I don't know. 🤷‍♂

Also, the reason there's no code examples/references here is because there aren't clear ones in this area. It's all a bit messy and complicated. sigh At least, the entire "generate metadata" step has the code decoupled from the rest of the things. I know @chrahunt has done some really nice work around the "download" + "unpack" steps, which I'm not even gone into the details of here. That's a different detour for another day.

[detour 2] The whole point of #6607 and related refactoring was initially motivated by wanting to separate out the "get metadata" bit from the rest of prepare. While the idea cooked in my brain, it ended up evolving into the beast of "let's just rework 1/3rd of pip's code, what could go wrong" (luckily nothing has as yet). Basically, the whole point of #6607 is to start by making this less of a problem, by decoupling and clarifying what happens where around this area.

(3)

This isn't "nice" either.

Given that pip's current approach to resolution (see #988's description for a quick summary), my understanding is that we'd want to explore the user given requirements first, before proceeding into the dependencies of those candidates.

This makes the resolution algorithm's starting step a little more... quirkly to figure out. The main reason I think we should do this is to ensure that we'd trim the space with the user provided information early, which gives the user some amount of control over the resolution algorithm (there's more detail on this in (5)).

I haven't come around to figuring out how exactly we'd do this basically. zazo got stuck when I understood this problem and, well, I haven't come up with anything workable to solve this in the time I've spent thinking about this since.

(4)

This is easy from pip's abstraction's PoV, it doesn't care. Backtracking and all is the resolver's job. For the resolution algorithm however, this step is critical and is what would differentiate between different algorithms basically -- What we do when we have a conflict basically determines how good the resolver is at producing a result and error messages.

From an overall perspective, we'd want to gather enough information here to remember to not make the same mistake again, and to provide useful information to the user in case there's no solution found. What exactly that looks like, we'd have to figure out. We'd also need to figure out what exactly we mean by "good error message" here -- basically, any and all conflicts are gonna be ridiculously difficult to deal with.

(5)

We'd want to limit how many times we end up making "choices" in the resolver, to make sure that the resolution doesn't end up taking forever for most users. In cases where we do (or we exhaust the search space), we'd want to print usable error messages to the user (see (4)).

The main way that I can think of doing this is that, the resolver should allow the user to feed information to the resolver to help trim the "resolution space" that it's trying to choose a package set from. Ideally, it'd be something we can generate after a failed resolution run, and suggest how the user can feed that in as an input to allow the next run to build upon that information.

I've said this before -- bad metadata is gonna be the most painful thing to deal with here for our end users. The above mechanism would also help users trim out "invalid" choices that the resolver doesn't understand due to bad metadata.

This sort-of works around the problem of finding the "wrong minimas" -- SAT solvers have restarts for this -- which we would be particularly bad for us, because the resolvelib/zazo abstractions aren't designed with restarts in mind, and there's very little we could do about this practically. I don't know if we'd hit that in our use case so I don't know how much effort should be put into the addressing this.

Footnotes:

¹ Or it's the other way around, I forget now and this is a minor detail anyway. ² We don't unpack wheels for getting metadata anymore: #7538 (yay @chrahunt)

Okay, it's 2 am now. I should sleep. I know there's definite holes in the story above, but we can start with this and fill it up as we go. :)

I'm going to continue taking things off on a tangent here, because I'm not aware of anywhere better at the moment for "general thoughts around resolver/build/requirements", and having loads of general discussion issues seems counter-productive (unless we're super-good at having clear issue titles, and I'm not 🙁)

I was wondering whether it would be worth adding instrumentation into the build process - tracking how many distribution files we download, how many calls we make to get metadata, how many full builds we do, etc. And how long they take. This would be a lot easier to integrate into the state-based project model, but should be doable whatever model we end up with. I'm thinking that one of the big bottlenecks for pip's use case is the need to generate dependency metadata, and the fact that it's a costly operation. But we don't really know how many extra dependency generation steps the new resolver might need (as part of backtracking or whatever), and trying to understand/optimise performance would be way easier if we had data to work with.

Thoughts? Would this be useful to do, or are there enough higher-priority tasks that it's something we should defer for later, until we know if we'll need the information? (Personally, I'm of the view that it's always worth collecting information like this, so it's really just a matter of whether there are more useful things for me to spend time on at this point).

If it seems like a worthwhile idea to people, I'll split it into its own issue.

As part of the effort to understand where time is spent during install I made a log parser that maps our verbose logs into higher level events. That or something like that should be fine, I think. I can try to upload it later today.

On Thu, Jan 16, 2020, 04:52 Paul Moore notifications@github.com wrote:

I'm going to continue taking things off on a tangent here, because I'm not aware of anywhere better at the moment for "general thoughts around resolver/build/requirements", and having loads of general discussion issues seems counter-productive (unless we're super-good at having clear issue titles, and I'm not 🙁)

I was wondering whether it would be worth adding instrumentation into the build process - tracking how many distribution files we download, how many calls we make to get metadata, how many full builds we do, etc. And how long they take. This would be a lot easier to integrate into the state-based project https://github.com/pypa/pip/issues/7259#issuecomment-546269862 model, but should be doable whatever model we end up with. I'm thinking that one of the big bottlenecks for pip's use case is the need to generate dependency metadata, and the fact that it's a costly operation. But we don't really know how many extra dependency generation steps the new resolver might need (as part of backtracking or whatever), and trying to understand/optimise performance would be way easier if we had data to work with.

Thoughts? Would this be useful to do, or are there enough higher-priority tasks that it's something we should defer for later, until we know if we'll need the information? (Personally, I'm of the view that it's always worth collecting information like this, so it's really just a matter of whether there are more useful things for me to spend time on at this point).

If it seems like a worthwhile idea to people, I'll split it into its own issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pypa/pip/issues/7589?email_source=notifications&email_token=AARUQU56YVVHXG33FDYHPGDQ6AU5NA5CNFSM4KGAJPX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDOQ7Q#issuecomment-575072382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARUQU7KIANG3HJH4VVTXOLQ6AU5NANCNFSM4KGAJPXQ .

I made a log parser that maps our verbose logs into higher level events. That or something like that should be fine, I think.

That sounds interesting, I'd like to see that. But I still wonder if having a structured object with the stats which we build as we go along and then dump to a file independently of the log might be useful for analysis. The log is already pretty cluttered, and I suspect it has the usual problem of all logs, it says when things start, but not when they end, making extracting something like the duration of a build step difficult.

But if you have an example, of what you get, that would be useful.

pypa / pip

Docs: Specify the types of requirement pip supports (internal details) #7589