Open uranusjr opened 6 years ago
Heh, you went right for a meaty one huh? 😜
So conceptually, PackageFinder
has a lot of stuff inside of it, just to make sure we're all on the same page and to help me with my thoughts, generally PackageFinder does:
InstallRequirement
object, of which it uses the name
attribute to get a list of all of the files that exist for the target, filtered by what files are compatible with the current system (and any control on the format that we have in place).InstallRequirement.specifier
.InstallRequirement.satisified_by.version
to determine if something matching this InstallRequirement.name
is already installed, and if so what version.None
to indicate that the current version is already the version that should be installed (either it's the latest version, or we're not upgrading), raises an error if it cannot find any version that satisfies, including the already installed version, OR returns the Link object that represents the file that we want to install.So there's a lot to unfold here, and I think that this class does entirely too much stuff within a single interface, so my suggestion for the first bite of code here, would be to extract out the bit of code that actually fetches a list of files from PyPI. Effectively focusing on step 1 in my control graph up there, before going any further.
It just so happens, I started on this awhile back and started some proof of concept pull requests, which are https://github.com/pypa/packaging/pull/87, https://github.com/pypa/packaging/pull/88, and https://github.com/dstufft/packaging/pull/1.
I would suggest reading through the threads and implementation there, particularly https://github.com/pypa/packaging/pull/87.
When I was writing that, I had a few basic ideas in mind:
foo
is my internal project, so I want to blacklist it from fetching from PyPI, incase someone registers that name there).Deferred
class, which can be used without needing an event loop. If we go with it, we'll want to get Deferred
split out of Twisted.I want to stress that the APIs I wrote in those PRs are not set in stone in any way, they were some rough proof of concepts when I started looking at doing exactly this previously. I would start by looking over those PRs and figuring out if you feel strongly about any of the possibly approaches I outlined, or if there is another approach that you think is better. I would go through all of the discussion and try to start answering any open questions that were still left open, as well as getting a PR setup to pip that replaces the parts of PackageFinder
that would be replaced by that class, you might find that the API ends up awkward once you start actually trying to integrate it.
That doesn't get you all of the way to replacing PackageFinder
, but it does replace a significant chunk of it, and PEP 503 already exists so there's no need to write a PEP for that bit of code. It's also code that is definitely going to be required, even if a real resolver interface is added (part of steps 2, 3, 4 are likely things that would be part of a hypothetical resolver interface, but due to pip's design are baked into this) so it represents something that we know for sure we're going to need into the future, and isn't something we need just as a side effect of pip's implementation details.
Of course, if there's another section that you're more interested in, it's possible that you could start somewhere else here, but a PEP 503 interface seems like the best first step, since the PEP is already written, there's some proof of concept code available already, and we know it's something we're going to be needing regardless of other internal changes inside of pip.
Thoughts?
Wow that is a lot :p I’ll probably reply each piece separately (it’s late here), but some fragmented thoughts:
html.parser.HTMLParser
is enough. This is a very minor problem though.Wow that is a lot :p
Nobody ever accused me of being terse!
The sans-I/O idea is interesting, but come to think of it, this is probably the way to go. I am not sure if I like the interface design, but I don’t have enough experience designing this kind of things to judge, so I guess I’ll need to experiement and find out.
I'm not married to any of the specific interfaces in any of the PRs, they were proof of concepts, so if you feel like picking one of them up and running with it, you should feel free to adjust the API. I was mostly trying to figure out an idea of what the shape of the API could possibly be, not a final actual API.
Regarding the HTML parser, I’m wondering what is the absolute minimal requirement in that regard. I know PEP 503 says the page is HTML5, so it’s most instinctive to use html5lib to parse, but I wonder if html.parser.HTMLParser is enough. This is a very minor problem though.
I dunno, the html5lib folks have always been pretty good at working with us, and it's only.. 0.5MB of space so I don't think it's super worth trying to optimize it.
The broken-out implementation probably can’t expect to take InstallRequirement. If dealing only strictly the first two steps, would packaging’s requirement object be enough? I’d guess it is since only the (canonical) name and the specifier are needed.
Yea I think so. My implementation doesn't even handle step 2, because in my mind the resolver would handle that so it only took the packages name as a str
. However for handling step 2, I think that the Requirement
object from packaging is good enough.
I don’t really feel it is necessary to replace the whole PackageFinder class. The end result only needs to deal with the problem of finding a list of installable things, and decide the preference between files. Whether an upgrade should happen should be decided by the code using it instead. This seems to be how the PRs are designed to do as well, but I am not sure whether it’s because they are WIPs.
Yea, the PackageFinder
class is a beast of a class that does way too much stuff, so reducing it into distinct pieces that some glue code can wire together would be great.
I wonder if it would be beneficial to design the repository to return a list of versions, each version containing a list of flies, instead of returning a flat list of them. It “makes sense” to do this, intuitively, but does it really in the context of Python packaging?
One of my comments in the PR (or maybe it was a code comment) was asking basically the same thing :)
The finder would contain three public functions:
@uranusjr You only listed two things?
@cjerdonek Oops :p The third one is add_dependency_links
. I moved the description to its own paragraph because it probably shouldn’t be copied, but forgot to note this in the associating text.
We should skip dependency links support here: I expect them to removed pretty soon and it's not standard backed anyways.
Other than that the two functions as described in the OP sound reasonable to me - it's the exact same change that I wanted to make within pip itself eventually.
After some thought, I think a flat list of packages is the better design. It is easier to filter, and actually easier to deal with, since there is no guarentee any given version has a compatible link for a given environment. It would be cumbersome if the user needs to deal with “empty” versions.
Also, is there a concrete (formal?) definition of how a find-link target (HTML page or files in a directory) should work? pip seems to contain only a very minimal description, and I don’t have much experience to know whether there are “features” unknown to me. Maybe I should write one first if none exists.
It's basically as simple as it sounds there -- it either looks for links in HTML file passed or checks the directory passed for the archives. These get treated the same way as items that would come from an index.
Internally, a combined list is created of the "file locations" and "url locations" from the index + find-link, before actually looking at contents of those.
What I want to clarify is how much assumptions I can make from it. For an HTML file, can I assume the link’s text is the filename? Would the target be named the same, or would I need to potentially rename it after download? What does “local” mean, do Windows UNC paths count? There are a few hidden possibilities that could need clarifying if the reader is not already well-versed in Python packaging.
For an HTML file, can I assume the link’s text is the filename? Would the target be named the same, or would I need to potentially rename it after download?
My understanding is that the files should be compliant with PEP 503's format.
What does “local” mean, do Windows UNC paths count?
I'm not sure about that.
I tried to write up some concrete rules.
A “find-link source” is a value passed to pip’s --find-links
option. It may be one of the following text values:
file:
URL to a directory.[1]: pip seems to use requests, and supports http:
, https:
, and file:
. Is this correct? See https://github.com/pypa/pip/issues/5800#issuecomment-424011426
The following sections define how each type of a find-link source should be.
If a find-link source points to a local directory, the directory may contain Python packages. Each package should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored. Distribution files from multiple projects may be put in the same directory.
If a file specified by a find-link source has an extension that looks like a Python package, i.e. .zip, .whl, .tar.gz, .tgz, or .tar, the file is treated as a Python package. Such a file should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored.
An HTML file specified by a find-link source must be of valid HTML 5, following the same rules of individual project pages outlined in the Simple Repository API specification, PEP 503. The anchor tags are parsed with the same rules.
The HTML file do not need to be served under a PEP 503-compatible URL. Distribution files from multiple projects may be listed in the same HTML file. The text of the anchor tag is used to identify which project a distribution file belongs to.
FIXME— What kind of URLs are acceptable. file:, http:, https: are obvious. Any others?
What does pip's code do -- or is this FIXME from pip's code base?
I added the FIXME to remind myself to check. I did some tracing, and it seems pip passes it directly to requests (except file:
), so I think it supports whatever requests supports[1], plus file:
.
[1]: pip extends requests’s schemes with VcsSupport
, but HTMLPage
has a check to specifically exclude VCS links.
Further digging into pip’s implementation, I found that if pip gets a directory in --find-links
, it seems to look for an index.html
inside it, but this doesn’t seem to be documented?
I don't think that's called from the branch handling find-links. They merely get processed through a Link
to _package_versions()
.
I’ve started a new repo with some POC implementation: https://github.com/uranusjr/packaging-repositories
There are still a lot of things to be sorted out, but I am very interested in any feedbacks on the (intentionally very minimal) API, especially whether this would really work as sans-I/O. I really lack experience dealing with this kind of design.
My implementation doesn't even handle step 2, because in my mind the resolver would handle that so it only took the packages name as a
str
.
I feel this is preferable. Having the resolver be able to keep track of which package versions it trimmed due to version specifiers is potentially helpful information. In case we're going to end up filtering the candidates based on just the version specifiers at the end, it doesn't have any extra cost functionally while keeping the option for using the extra information in the resolver open in the future.
I am working on extracting things from PackageFinder and HTMLPage right now, and happen to find a bug :p
This yields an incorrect result if search_name
is None
:
>>> egg_info_matches('pip-18.0', None)
'-18.0'
>>> egg_info_matches('pip-18.0', 'pip')
'18.0'
The bug is never hit because pip never passes None
. The function is used only (AFAICT) when parsing links in the HTML page, but when pip does this, it always knows what it is looking for, and just passes that as the name to search.
Nothing meaningful, I just want to make a note about this discovery. There seems to be quite some dead code in index.py
.
Thanks, @uranusjr! Can you file an issue each time you notice something like that? For the general last thing you mentioned, it can even be called something like “index.py contains dead code” with a description of what you noticed. Do you have any interest in fixing issues you find? Lastly, when you say “extract,” will you be doing that within the pip code base, like I suggested in my last email to the distutils SIG? That would ensure that your final version is a reference implementation empirically compatible / working with pip. Then getting pip to use a library would simply be a matter of copying that code to the independent library.
Yes, I intend to do this on the pip code base. My plan is something like
I’m still in a very early stage of step 1, taking notes and trying to decide how things should be splitted. Once I have an idea about how to split things, I’ll open a WIP pull request so the work would be more public and easier to view. I would likely include cleanups to those dead code, but will probably wait until I open the WIP PR before opening issues for them, to avoid myself being distracted by them. But I am keeping notes about them, and they will be identified.
One suggestion I have on your proposal is that rather than aiming for / thinking of it as a single WIP pull request, I think it would be a lot more productive to approach it as a series of smaller stand-alone pull requests, each of which makes an improvement. You can file PR's as you come across things rather than having to wait. That way it will be easier for people to review, more incremental, etc. And the code base will became easier to understand as you go. Also, I imagine many of the changes you have in mind would be good in any case, even if the end result isn't yet finalized in your mind.
@cjerdonek Agreed. Also, smaller PRs stand less chance of clashing with other work. I'm particularly conscious of this as at the moment the PEP 517 work is a large-ish PR that is likely to cause (and suffer) merge conflicts - it's quite distracting having the additional pressure of this over the general need to get the work completed - so small, self-contained PRs is definitely my preference and recommendation.
I see. As mentioned, I intend to keep the interface of PackageFinder intact, so my changes would be pretty much self-contained. I will keep this in mind and take extra care to make sure things don’t leak, and try to make things into smaller pieces if I could. Thanks for the advice :)
The pip-tools
project may also benefit from splitting out PackageFinder
, as they are currently using the internal API: https://github.com/jazzband/pip-tools/issues/853
I have been working a lot on improving the structure of that module, making it more testable, adding new features (yanked files, preferring hash matches, etc), decoupling things, etc, FYI.
I think we're basically done w.r.t. decoupling / refactoring this module.
What's the problem this feature will solve?
This is a continuation of the discussion on distutils-sig, Donald suggested I should open an issue here to outline the intention. I’ll try to summarise previous discussions, and provide some extra context why this is desired.
The idea came out of Pipenv’s need to build a resolver independent from pip. Pipenv tries to generate a platform-independent “lock” (dependency graph) for the user-specified requirements. This means that it cannot use pip’s high-level CLI to find dependencies directly (because it filters out packages not matching markers). On the other hand, pip provides a variety of ways to “find” packages, including index servers, find-links, dependency links, VCS, etc. Since Pipenv is expected to be highly compatible to pip, it is likely a good idea for it to have an implementation matching pip’s.
Describe the solution you'd like
The minimal requirement would be to document (standardise) pip’s current behaviour, and create an implementation (maybe largely copy from pip) based on that behaviour. The behaviour would be described as a PEP (or a part of one), and implemented in a standalone project. Whether pip will adopt the implementation to replace its current one is not relevant; it can freely decide either way.
Conceptually, I am splitting the current PackageFinder into two layers:
I plan to first focus on the first part, and work on the second after the first part is settled.
The finder would contain three public functions:
For the requirement specification,
packaging.requirement.Requirement
would be used to interface. An additional class would be introduced for the installation candidate, modelled afterpip._internal.models.candidate.InstallationCandidate
.Behaviourally, the finder would contain all (or almost all) functionalities in PackageFinder. One particular exception would be
add_dependency_links
. There is a particular FIXME that could be cleaned up, but would it be worth it? #4187 mentions it is still planned to be removed in pip 19, and maybe it should be omitted from the standardised behaviour altogether.I am most definitely there are a lot of caveats needing address (“standardised”). Please raise any relevant topics.
A list of resolutions, updated so I (and others reading) don’t need to go through the whole thread: