Splitting out PackageFinder logic

uranusjr commented 6 years ago

What's the problem this feature will solve?

This is a continuation of the discussion on distutils-sig, Donald suggested I should open an issue here to outline the intention. I’ll try to summarise previous discussions, and provide some extra context why this is desired.

The idea came out of Pipenv’s need to build a resolver independent from pip. Pipenv tries to generate a platform-independent “lock” (dependency graph) for the user-specified requirements. This means that it cannot use pip’s high-level CLI to find dependencies directly (because it filters out packages not matching markers). On the other hand, pip provides a variety of ways to “find” packages, including index servers, find-links, dependency links, VCS, etc. Since Pipenv is expected to be highly compatible to pip, it is likely a good idea for it to have an implementation matching pip’s.

Describe the solution you'd like

The minimal requirement would be to document (standardise) pip’s current behaviour, and create an implementation (maybe largely copy from pip) based on that behaviour. The behaviour would be described as a PEP (or a part of one), and implemented in a standalone project. Whether pip will adopt the implementation to replace its current one is not relevant; it can freely decide either way.

Conceptually, I am splitting the current PackageFinder into two layers:

A finder that only considers Python version compatibility.
A finder (subclass of 1?) that also considers PEP 425 tags and filter out incompatible binaries.

I plan to first focus on the first part, and work on the second after the first part is settled.

The finder would contain three public functions:

Given a requirement specification, find all possible “installation candidates”.
Given a requirement specification or installation candidate, find a list of artifacts suitable for download.

For the requirement specification, packaging.requirement.Requirement would be used to interface. An additional class would be introduced for the installation candidate, modelled after pip._internal.models.candidate.InstallationCandidate.

Behaviourally, the finder would contain all (or almost all) functionalities in PackageFinder. One particular exception would be add_dependency_links. There is a particular FIXME that could be cleaned up, but would it be worth it? #4187 mentions it is still planned to be removed in pip 19, and maybe it should be omitted from the standardised behaviour altogether.

I am most definitely there are a lot of caveats needing address (“standardised”). Please raise any relevant topics.

A list of resolutions, updated so I (and others reading) don’t need to go through the whole thread:

The implementation would be sans-IO
Per-source filtering is not needed
Accept packaging’s Requirement as input
The final return value of the fetcher would be a flat list of links, not partitioned by version
Do not need to consider dependency links support

dstufft commented 6 years ago

Heh, you went right for a meaty one huh? 😜

So conceptually, PackageFinder has a lot of stuff inside of it, just to make sure we're all on the same page and to help me with my thoughts, generally PackageFinder does:

Accepts an InstallRequirement object, of which it uses the name attribute to get a list of all of the files that exist for the target, filtered by what files are compatible with the current system (and any control on the format that we have in place).
Further filters down that list of files by InstallRequirement.specifier .
Looks at InstallRequirement.satisified_by.version to determine if something matching this InstallRequirement.name is already installed, and if so what version.
Also accepts a boolean that represents if we're supposed to upgrade or not, and using that and the information described above, either returns None to indicate that the current version is already the version that should be installed (either it's the latest version, or we're not upgrading), raises an error if it cannot find any version that satisfies, including the already installed version, OR returns the Link object that represents the file that we want to install.

So there's a lot to unfold here, and I think that this class does entirely too much stuff within a single interface, so my suggestion for the first bite of code here, would be to extract out the bit of code that actually fetches a list of files from PyPI. Effectively focusing on step 1 in my control graph up there, before going any further.

It just so happens, I started on this awhile back and started some proof of concept pull requests, which are https://github.com/pypa/packaging/pull/87, https://github.com/pypa/packaging/pull/88, and https://github.com/dstufft/packaging/pull/1.

I would suggest reading through the threads and implementation there, particularly https://github.com/pypa/packaging/pull/87.

When I was writing that, I had a few basic ideas in mind:

The library should not do any I/O itself, and should not hard bake in any particular HTTP client, that should be passed into it from the caller.
Likewise, we should not require synchronous I/O in this API, but it should support it.
- All three of the PRs share this right now, however these PRs predate trio, and we'd want to support using this hypothetical API with trio as well. It's not clear to me if the Deferred or Effectfully API can be used with trio. It would be good to figure that out, because that would push me towards the "Sans i/O" implementation IMO.
We need to support multiple repositories, some of which will have more information available to us than others.
We might want to support filtering the returned list for things like what format we accept, although I'm not entirely sure that we do.
- One argument for supporting filtering is that people may want to support different filters per repository, for instance they may want to only trust sdists from PyPI, but will use wheels from their own private server, or they may want to limit the names of things they are willing to get from a particular repository (e.g., I know that foo is my internal project, so I want to blacklist it from fetching from PyPI, incase someone registers that name there).
- The primary argument against supporting filtering in this API, is that it makes more advanced things harder. For instance you couldn't provide hints that a library stopped producing wheels and thus you're on a very old version, if the calling library never actually sees anything but wheels because the filter was filtering out all of the sdists.
The Deferred version of that PR depends on twisted, not for any actual I/O handling, but simply for the Deferred class, which can be used without needing an event loop. If we go with it, we'll want to get Deferred split out of Twisted.

I want to stress that the APIs I wrote in those PRs are not set in stone in any way, they were some rough proof of concepts when I started looking at doing exactly this previously. I would start by looking over those PRs and figuring out if you feel strongly about any of the possibly approaches I outlined, or if there is another approach that you think is better. I would go through all of the discussion and try to start answering any open questions that were still left open, as well as getting a PR setup to pip that replaces the parts of PackageFinder that would be replaced by that class, you might find that the API ends up awkward once you start actually trying to integrate it.

That doesn't get you all of the way to replacing PackageFinder, but it does replace a significant chunk of it, and PEP 503 already exists so there's no need to write a PEP for that bit of code. It's also code that is definitely going to be required, even if a real resolver interface is added (part of steps 2, 3, 4 are likely things that would be part of a hypothetical resolver interface, but due to pip's design are baked into this) so it represents something that we know for sure we're going to need into the future, and isn't something we need just as a side effect of pip's implementation details.

Of course, if there's another section that you're more interested in, it's possible that you could start somewhere else here, but a PEP 503 interface seems like the best first step, since the PEP is already written, there's some proof of concept code available already, and we know it's something we're going to be needing regardless of other internal changes inside of pip.

Thoughts?

uranusjr commented 6 years ago

Wow that is a lot :p I’ll probably reply each piece separately (it’s late here), but some fragmented thoughts:

The sans-I/O idea is interesting, but come to think of it, this is probably the way to go. I am not sure if I like the interface design, but I don’t have enough experience designing this kind of things to judge, so I guess I’ll need to experiement and find out.
Regarding the HTML parser, I’m wondering what is the absolute minimal requirement in that regard. I know PEP 503 says the page is HTML5, so it’s most instinctive to use html5lib to parse, but I wonder if html.parser.HTMLParser is enough. This is a very minor problem though.
I am not very enthusiastic about per-repo filters. I feel most of the use cases can be worked around by instantiating extra repository instances for non-default use cases. While not be the most efficient (performance-wise), it is doable, and not general enough to complicate the overall design.
The broken-out implementation probably can’t expect to take InstallRequirement. If dealing only strictly the first two steps, would packaging’s requirement object be enough? I’d guess it is since only the (canonical) name and the specifier are needed.
I don’t really feel it is necessary to replace the whole PackageFinder class. The end result only needs to deal with the problem of finding a list of installable things, and decide the preference between files. Whether an upgrade should happen should be decided by the code using it instead. This seems to be how the PRs are designed to do as well, but I am not sure whether it’s because they are WIPs.
I wonder if it would be beneficial to design the repository to return a list of versions, each version containing a list of flies, instead of returning a flat list of them. It “makes sense” to do this, intuitively, but does it really in the context of Python packaging?

dstufft commented 6 years ago

Wow that is a lot :p

Nobody ever accused me of being terse!

The sans-I/O idea is interesting, but come to think of it, this is probably the way to go. I am not sure if I like the interface design, but I don’t have enough experience designing this kind of things to judge, so I guess I’ll need to experiement and find out.

I'm not married to any of the specific interfaces in any of the PRs, they were proof of concepts, so if you feel like picking one of them up and running with it, you should feel free to adjust the API. I was mostly trying to figure out an idea of what the shape of the API could possibly be, not a final actual API.

Regarding the HTML parser, I’m wondering what is the absolute minimal requirement in that regard. I know PEP 503 says the page is HTML5, so it’s most instinctive to use html5lib to parse, but I wonder if html.parser.HTMLParser is enough. This is a very minor problem though.

I dunno, the html5lib folks have always been pretty good at working with us, and it's only.. 0.5MB of space so I don't think it's super worth trying to optimize it.

The broken-out implementation probably can’t expect to take InstallRequirement. If dealing only strictly the first two steps, would packaging’s requirement object be enough? I’d guess it is since only the (canonical) name and the specifier are needed.

Yea I think so. My implementation doesn't even handle step 2, because in my mind the resolver would handle that so it only took the packages name as a str. However for handling step 2, I think that the Requirement object from packaging is good enough.

I don’t really feel it is necessary to replace the whole PackageFinder class. The end result only needs to deal with the problem of finding a list of installable things, and decide the preference between files. Whether an upgrade should happen should be decided by the code using it instead. This seems to be how the PRs are designed to do as well, but I am not sure whether it’s because they are WIPs.

Yea, the PackageFinder class is a beast of a class that does way too much stuff, so reducing it into distinct pieces that some glue code can wire together would be great.

I wonder if it would be beneficial to design the repository to return a list of versions, each version containing a list of flies, instead of returning a flat list of them. It “makes sense” to do this, intuitively, but does it really in the context of Python packaging?

One of my comments in the PR (or maybe it was a code comment) was asking basically the same thing :)

cjerdonek commented 6 years ago

The finder would contain three public functions:

@uranusjr You only listed two things?

uranusjr commented 6 years ago

@cjerdonek Oops :p The third one is add_dependency_links. I moved the description to its own paragraph because it probably shouldn’t be copied, but forgot to note this in the associating text.

pradyunsg commented 6 years ago

We should skip dependency links support here: I expect them to removed pretty soon and it's not standard backed anyways.

Other than that the two functions as described in the OP sound reasonable to me - it's the exact same change that I wanted to make within pip itself eventually.

uranusjr commented 6 years ago

After some thought, I think a flat list of packages is the better design. It is easier to filter, and actually easier to deal with, since there is no guarentee any given version has a compatible link for a given environment. It would be cumbersome if the user needs to deal with “empty” versions.

Also, is there a concrete (formal?) definition of how a find-link target (HTML page or files in a directory) should work? pip seems to contain only a very minimal description, and I don’t have much experience to know whether there are “features” unknown to me. Maybe I should write one first if none exists.

pradyunsg commented 6 years ago

It's basically as simple as it sounds there -- it either looks for links in HTML file passed or checks the directory passed for the archives. These get treated the same way as items that would come from an index.

Internally, a combined list is created of the "file locations" and "url locations" from the index + find-link, before actually looking at contents of those.

uranusjr commented 6 years ago

What I want to clarify is how much assumptions I can make from it. For an HTML file, can I assume the link’s text is the filename? Would the target be named the same, or would I need to potentially rename it after download? What does “local” mean, do Windows UNC paths count? There are a few hidden possibilities that could need clarifying if the reader is not already well-versed in Python packaging.

pradyunsg commented 6 years ago

For an HTML file, can I assume the link’s text is the filename? Would the target be named the same, or would I need to potentially rename it after download?

My understanding is that the files should be compliant with PEP 503's format.

What does “local” mean, do Windows UNC paths count?

I'm not sure about that.

uranusjr commented 6 years ago

I tried to write up some concrete rules.

Edit 2018-09-27: pip seems to recognise find-link values that point to a local archive. I have added a section to describe this behaviour.

A “find-link source” is a value passed to pip’s --find-links option. It may be one of the following text values:

A filesystem path or a file: URL to a directory.
A filesystem path or a URL to an HTML file. [1]

[1]: pip seems to use requests, and supports http:, https:, and file:. Is this correct? See https://github.com/pypa/pip/issues/5800#issuecomment-424011426

The following sections define how each type of a find-link source should be.

Local directory

If a find-link source points to a local directory, the directory may contain Python packages. Each package should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored. Distribution files from multiple projects may be put in the same directory.

Archive file

If a file specified by a find-link source has an extension that looks like a Python package, i.e. .zip, .whl, .tar.gz, .tgz, or .tar, the file is treated as a Python package. Such a file should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored.

HTML file

An HTML file specified by a find-link source must be of valid HTML 5, following the same rules of individual project pages outlined in the Simple Repository API specification, PEP 503. The anchor tags are parsed with the same rules.

The HTML file do not need to be served under a PEP 503-compatible URL. Distribution files from multiple projects may be listed in the same HTML file. The text of the anchor tag is used to identify which project a distribution file belongs to.

cjerdonek commented 6 years ago

FIXME— What kind of URLs are acceptable. file:, http:, https: are obvious. Any others?

What does pip's code do -- or is this FIXME from pip's code base?

uranusjr commented 6 years ago

I added the FIXME to remind myself to check. I did some tracing, and it seems pip passes it directly to requests (except file:), so I think it supports whatever requests supports[1], plus file:.

[1]: pip extends requests’s schemes with VcsSupport, but HTMLPage has a check to specifically exclude VCS links.

uranusjr commented 6 years ago

Further digging into pip’s implementation, I found that if pip gets a directory in --find-links, it seems to look for an index.html inside it, but this doesn’t seem to be documented?

pradyunsg commented 6 years ago

I don't think that's called from the branch handling find-links. They merely get processed through a Link to _package_versions().

uranusjr commented 6 years ago

I’ve started a new repo with some POC implementation: https://github.com/uranusjr/packaging-repositories

There are still a lot of things to be sorted out, but I am very interested in any feedbacks on the (intentionally very minimal) API, especially whether this would really work as sans-I/O. I really lack experience dealing with this kind of design.

pradyunsg commented 6 years ago

My implementation doesn't even handle step 2, because in my mind the resolver would handle that so it only took the packages name as a str.

I feel this is preferable. Having the resolver be able to keep track of which package versions it trimmed due to version specifiers is potentially helpful information. In case we're going to end up filtering the candidates based on just the version specifiers at the end, it doesn't have any extra cost functionally while keeping the option for using the extra information in the resolver open in the future.

uranusjr commented 6 years ago

I am working on extracting things from PackageFinder and HTMLPage right now, and happen to find a bug :p

https://github.com/pypa/pip/blob/dcc3c16b5461b45c1c4e26a3f8342560260af204/src/pip/_internal/index.py#L680-L706

This yields an incorrect result if search_name is None:

>>> egg_info_matches('pip-18.0', None)
'-18.0'
>>> egg_info_matches('pip-18.0', 'pip')
'18.0'

The bug is never hit because pip never passes None. The function is used only (AFAICT) when parsing links in the HTML page, but when pip does this, it always knows what it is looking for, and just passes that as the name to search.

Nothing meaningful, I just want to make a note about this discovery. There seems to be quite some dead code in index.py.

cjerdonek commented 6 years ago

Thanks, @uranusjr! Can you file an issue each time you notice something like that? For the general last thing you mentioned, it can even be called something like “index.py contains dead code” with a description of what you noticed. Do you have any interest in fixing issues you find? Lastly, when you say “extract,” will you be doing that within the pip code base, like I suggested in my last email to the distutils SIG? That would ensure that your final version is a reference implementation empirically compatible / working with pip. Then getting pip to use a library would simply be a matter of copying that code to the independent library.

uranusjr commented 6 years ago

Yes, I intend to do this on the pip code base. My plan is something like

Extract bits in index.py into smaller functions (still in index.py) without changing any code paths.
Try to swap out parts in index.py with the implementation in packaging_repositories, constant running unit tests to make sure things still work.
The ideal end result would keep the PackageFinder API intact, but with internals switched.
Either make packaging_repositories part of packaging, or keep it as a standalone library, so pip can vendor it.

I’m still in a very early stage of step 1, taking notes and trying to decide how things should be splitted. Once I have an idea about how to split things, I’ll open a WIP pull request so the work would be more public and easier to view. I would likely include cleanups to those dead code, but will probably wait until I open the WIP PR before opening issues for them, to avoid myself being distracted by them. But I am keeping notes about them, and they will be identified.

cjerdonek commented 6 years ago

One suggestion I have on your proposal is that rather than aiming for / thinking of it as a single WIP pull request, I think it would be a lot more productive to approach it as a series of smaller stand-alone pull requests, each of which makes an improvement. You can file PR's as you come across things rather than having to wait. That way it will be easier for people to review, more incremental, etc. And the code base will became easier to understand as you go. Also, I imagine many of the changes you have in mind would be good in any case, even if the end result isn't yet finalized in your mind.

pfmoore commented 6 years ago

@cjerdonek Agreed. Also, smaller PRs stand less chance of clashing with other work. I'm particularly conscious of this as at the moment the PEP 517 work is a large-ish PR that is likely to cause (and suffer) merge conflicts - it's quite distracting having the additional pressure of this over the general need to get the work completed - so small, self-contained PRs is definitely my preference and recommendation.

uranusjr commented 6 years ago

I see. As mentioned, I intend to keep the interface of PackageFinder intact, so my changes would be pretty much self-contained. I will keep this in mind and take extra care to make sure things don’t leak, and try to make things into smaller pieces if I could. Thanks for the advice :)

di commented 5 years ago

The pip-tools project may also benefit from splitting out PackageFinder, as they are currently using the internal API: https://github.com/jazzband/pip-tools/issues/853

cjerdonek commented 5 years ago

I have been working a lot on improving the structure of that module, making it more testable, adding new features (yanked files, preferring hash matches, etc), decoupling things, etc, FYI.

pradyunsg commented 4 years ago

I think we're basically done w.r.t. decoupling / refactoring this module.

pypa / pip