pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.42k stars 2.99k forks source link

A `pip resolve` command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!) #7819

Closed cosmicexplorer closed 1 year ago

cosmicexplorer commented 4 years ago

Please let me know if it would be more convenient to provide this issue in another form such as a google doc or something!

What's the problem this feature will solve?

At Twitter, we are trying to enable the creation of self-bootstrapping "ipex" files, executable zip files of Python code which can resolve 3rdparty requirements when first run. This approach greatly reduces the time to build, upload, and deploy compared to a typical PEX file, which contains all of its dependencies in a single monolithic zip archive created at pex build time. The implementation of "ipex" in pantsbuild/pants#8793 (more background at that link) will invoke pex at runtime, which will itself invoke a pip subprocess (since pex version 2) to resolve these 3rdparty dependencies. #7729 is a separate performance fix to enable this runtime resolve approach.

Because ipex files do not contain their 3rdparty requirements at build time, it's not necessary to run the entirety of pip download or pip install. Instead, in pantsbuild/pants#8793, pants will take all of the requirements provided by the user (which may include requirements with inequalities, or no version constraints at all), then convert to a list of transitive == requirements. This ensures that the ipex file will resolve the same requirements at build time and run time, even if the index changes in between.

Describe the solution you'd like

A pip resolve command with similar syntax to pip download, which instead writes a list of == requirement strings, each with a single download URL, to stdout, corresponding to the transitive dependencies of the input requirements. These download URLs correspond to every file that would have been downloaded by pip download.

pants would be able to invoke pip resolve as a distinct phase of generating an ipex file. pex would likely not be needed to intermediate this resolve command -- we could just execute pip resolve directly as a subprocess from within pants. The pants v2 engine makes process executions individually cacheable, and transparently executable on a remote cluster via the Bazel Remote Execution API, so pants users would then be able to generate these "dehydrated" ipex files at extremely low latency if the pip resolve command can be made performant enough.

Alternative Solutions / Prototype Implementation

As described above, pantsbuild/pants#8793 is able to create ipex files already, by simply using pip download via pex to extract the transitive == requirements. The utility of a separate pip resolve command, if any, would lie in whether it can achieve the same end goal of extracting transitive == requirements, but with significantly greater performance.

In a pip branch I have implemented a prototype pip resolve command which is able to achieve an immediate ~2x speedup vs pip download on the first run, before almost immediately levelling out to 800ms on every run afterwards.

This performance is achieved with two techniques:

  1. Extracting the contents of the METADATA file from a url for a wheel without actually downloading the wheel at all.
    • _hacky_extract_sub_reqs() (see https://github.com/cosmicexplorer/pip/blob/a60a3977e929cfaed6d64b0c9e3713d7c502e51e/src/pip/_internal/resolution/legacy/resolver.py#L550-L552) will: a. send a HEAD request to get the length of the zip file b. perform several successive GET requests to extract the relative location of the METADATA file c. extract the DEFLATE-compressed METADATA file and INFLATE it d. parse all Requires-Dist lines in METADATA for requirement strings
    • This is surprisingly reliable, and extremely fast! This makes pip resolve tensorflow==1.14 take 15 seconds, compared to 24 seconds for pip download tensorflow==1.14.
    • A URL to a non-wheel file is processed the normal way -- by downloading the file, then preparing it into a dist.
  2. Caching the result of each self._resolve_one() call in a persistent json file.

Additional context

This pip resolve command as described above (with the resolve cache) would possibly be able to resolve this long-standing TODO about separating dependency resolution from preparation, without requiring any separate infrastructure changes on PyPI's part: https://github.com/pypa/pip/blob/f2fbc7da81d3c8a2732d9072219029379ba3bad5/src/pip/_internal/resolution/legacy/resolver.py#L158-L160

I have only discussed the single "ipex" motivating use case here, but I want to make it clear that I am making this issue because I believe a pip resolve command would be generally useful to all pip users. I didn't implement it in the prototype above, but I believe that after the pip resolve command stabilizes and any inconsistencies between it and pip download are worked out, it would likely be possible to make pip download consume the output of pip resolve directly, which would allow removal of the if self.quickly_parse_sub_requirements conditionals added to resolver.py, as well as (probably) improve pip download performance by waiting to download every wheel file in parallel after resolving URLs for them with pip resolve!

For that reason, I think a pip resolve command which can quickly resolve URLs for requirements before downloading them is likely to be a useful feature for all pip users.

I am extremely open to designing/implementing whatever changes pip contributors might desire in order for this change to go in, and I would also fully understand if this use case is something pip isn't able to support right now.

pfmoore commented 4 years ago

I haven't read this proposal in detail yet, but are you aware that there's ongoing work to implement a new resolver in pip - #6536?

We should certainly look at the ideas here and consider how they integrate with the new resolver work (I am personally particularly interested in the idea of getting wheel metadata without downloading the whole wheel).

Ping @pradyunsg @uranusjr - we should pick this up tomorrow in our meeting and feed back to @cosmicexplorer from that.

cosmicexplorer commented 4 years ago

Thank you so much for the prompt reply!! I was not aware of that (or had forgotten) and will read up on #6536 now! I will see if I can refactor this issue into a comment on #6536!

(I am personally particularly interested in the idea of getting wheel metadata without downloading the whole wheel)

:D :D I didn't expect that to work out at all, it was super surprising! I would definitely love to help clarify how this part works and will look to see how to fit it into the context of #6536!

pradyunsg commented 4 years ago

@cosmicexplorer this is cool stuff! I'm definitely interested in taking a look at this and piggy-backing off of this work to possibly integrate it with the new resolver! @pfmoore discussing this in the syncup sounds like a great idea to me!

Also, kudos @cosmicexplorer (and anyone else who worked in this) for actually working on this, with the not-so-great abstractions that we have in pip's internals back when y'all started this work. Maybe we can discuss a bit about how it was to deal with them if we get time to chat. :)

cosmicexplorer commented 4 years ago

Also, kudos @cosmicexplorer (and anyone else who worked in this) for actually working on this, with the not-so-great abstractions that we have in pip's internals back when y'all started this work. Maybe we can discuss a bit about how it was to deal with them if we get time to chat. :)

Just me for now! Would love to! The thing I had to iterate several times on (off and on since last August) was just hammering out a way to make some requirements get processed without calling .prepare_linked_requirement() beforehand. I chose to mutate e.g. the new .force_eager_download field of InstallRequirement instances in this prototype to avoid having to modify too much of the control flow in legacy/resolver.py. This usage of mutation then forced me to create an "immutable" RequirementConcreteUrl cache key class so that it could be persisted, separate from any mutable InstallRequirement instance within a single pip invocation.

I feel like having download occur as part of the resolve process by itself made it specifically more difficult to modify the implementation for performance without changing the behavior of the resolution process in subtle ways. I had experimented with adding multithreading earlier, but had to comment out a very large amount of code in many separate files to avoid failing assertions (e.g. only editable requirements can have an existing source_dir, or something) before even getting it to run once on a small example.

So I think splitting out a pip resolve command, which would produce some programmatic output that pip download can then consume directly, even without any of the performance enhancements described above, might be an interesting way to structure the new resolver API alone. And having something like RequirementConcreteUrl used to represent "an == requirement from a specific URL" with a specific type, instead of having that concept only expressed indirectly via mutable state in InstallRequirement instances, seems like it might also contribute to more effectively reasoning about pip resolves in the abstract, which feels like fertile ground for further performance enhancements down the road.

EDIT: Hmm, in reading #5051 I'm realizing that the RequirementConcreteUrl I've used in this prototype should probably be something like InstallationCandidate instead.

uranusjr commented 4 years ago

@cosmicexplorer Would you be interested in taking a look at #7799, which is trying to add a new resolver to pip? I’m hitting problems trying to build “prepare” multiple versions of a package. The approach I’m taking now is to “clone” the InstallRequirement and prepare the cloned instance. It seems like we’re doing a lot of the same things, and I’d be super grateful if we could exchange some insights in this area.

pradyunsg commented 4 years ago

I've spent some time understanding understand the work done here and here are my notes / brain-dump of my understanding. @cosmicexplorer Definitely feel free to correct me in case I understood something incorrectly here. :)

(coming back up here, to flag that my tone in the rest of the comment may not come across as 100% positive but I really am super glad that someone has actually worked on this stuff!)


pip resolve command proposed here, is trying to solve the same problem as one of the oldest open issues in pip -- show what version would be installed by a pip install run -- https://github.com/pypa/pip/issues/53. There have been similar requests in the past too, but no one actually showed up with an implementation -- eg. https://github.com/pypa/pip/issues/6340, https://github.com/pypa/pip/issues/6430 (and more that I really don't want to spend the time searching for). This specific feature has on my personal radar for a long time as a "do after the new resolver rollout" task.

@cosmicexplorer a question regarding the implementation strategy taken -- any reason you didn't add a flag to pip install and add a "do a print and exit" after https://github.com/pypa/pip/blob/60d640276a96547d8ff78725edbd36b96f60ace9/src/pip/_internal/commands/install.py#L380-L382

Did you create a new command, to avoid "reading" from the dependency cache in pip install and friends?

As far as I can tell, this work is strictly unrelated to bringing in a new resolver and changing the resolution behavior/result (which is the work I've been chipping at since 2017 and, recently, @pfmoore, @uranusjr and I have been working on implementing as part of the donor funded work).


Beyond that, there has been a whole bunch of work done around speeding up the "get the metadata" step -- the most "expensive" step in our dependency resolution process -- through strategies that we have discussed in the past like (1) dependency information caching and (2) metadata from partial downloads. The fact that this stuff has actually been implemented and is working is great; since I expect we can reuse/adopt these pretty easily -- which is awesome! We should have dedicated discussions for both of them separately and I'm 100% on board for "bringing them in".

The most recent comments mentioning "we should do X" for the these:


There are other wider improvements that affect other tooling other pip haven't been tried as part of @cosmicexplorer's work (which is fair, but I want an excuse to list them in one place so...) such as exposing reliable dependency metadata directly from sdists, directly from package indexes and without requiring complete PEP 517 builds / wheel metadata preparation[1].

[1]: we'd need to a new optional "build-system" hook -- to directly get dependency information from a source tree -- it was present in initial drafts of PEP 517, but subsequently removed as YAGNI; despite my protests that it's needed by a proper resolver. 😞


Overall, everything implemented here is useful though we'd need to think a bit about the rollout and implications of these changes (like we would for any "broad" changes to pip) before merging them in. I think we should definitely think whether we want to bundle parts of this work (especially metadata speedups) with the new resolver rollout.

pradyunsg commented 4 years ago

Relabeled, since this isn't about changing dependency resolution results, but exposing the results more directly in the CLI and bringing in speedups to "get metadata".

dholth commented 4 years ago

I'm delighted that you are taking advantage of this wheel feature.

Did you know you can do a range request for the last n bytes of a file, without knowing its length?

I wrote this in 2012. It is a seekable file backed by http and a sparse on-disk file. You can pass it to ZipFile and download exactly the files you want to download plus the zip index. It will require some updating and removing of debug features. https://github.com/dholth/httpfile/blob/master/httpfile.py

Wheel also suggests that the metadata (the .dist-info directory) be at the end of the .zip archive. I'm not sure if everyone follows that suggestion.

cosmicexplorer commented 4 years ago

Did you know you can do a range request for the last n bytes of a file, without knowing its length?

I did not! Thank you!!!

I wrote this in 2012. It is a seekable file backed by http and a sparse on-disk file. You can pass it to ZipFile and download exactly the files you want to download plus the zip index. It will require some updating and removing of debug features. https://github.com/dholth/httpfile/blob/master/httpfile.py

It would be really really nice to be able to use this instead of the _hackily_extract_sub_requirements() method in the prototype branch at https://github.com/pypa/pip/compare/master...cosmicexplorer:requirement-dependencies-cache?expand=1!! I would really like to help make this process less hacky and your work here seems really relevant!! I'll check it out!!!

Wheel also suggests that the metadata (the .dist-info directory) be at the end of the .zip archive. I'm not sure if everyone follows that suggestion

Currently, this approach will resolve the last 2000 (arbitrary) bytes of a zip file, then read the central directory header from the zip file to locate the METADATA file. So this technique relies on the central directory header being at the end of the file (required by the zip format, I believe) to read where METADATA is, then it makes a range request for just those bytes. So it doesn't require that METADATA itself is at the end of the archive.

EDIT: See section 4.3 of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT regarding the central directory header!

dholth commented 4 years ago

Try fetching the last 8k of the file to start. It will get you the central directory and METADATA more than half the time, based on looking at some of the most popular wheels.

#!/usr/bin/env python
import glob
import zipfile
import os.path

for wheel in sorted(glob.glob('*.whl'), key=lambda x: os.path.getsize(x)):
    wf = zipfile.ZipFile(wheel)
    metadata = next(info for info in reversed(wf.infolist()) if info.filename.endswith('/METADATA'))
    offset = metadata.header_offset
    size = os.path.getsize(wheel)
    print("%6d %s offset: %d cd: %d" % (size - offset, wheel, offset, size - wf.start_dir))
cosmicexplorer commented 4 years ago

That’s a fantastic optimization that I also hadn’t considered ^_^!!!! Hey — would you accept it if I added this functionality (adapting your code above) as a PR against your httpfile repo? It would be a first draft — I’m very good at iterating on changes in response to feedback. I could try to make a separate PR with some attempt at the updates you mentioned in your first message as well.

If I have major issues for some reason with httpfile, I’ll try to open an issue. I think it sounds like a great idea to use your existing work here, though.

dholth commented 4 years ago

I'd be glad to accept pull requests against httpfile to develop it as a useful package instead of just a proof of concept

pradyunsg commented 4 years ago

@pfmoore @uranusjr I think it'd be handy to add a card to our project board to explore this if we can. This can be a significant speedup due to less time spent hitting the network.

pradyunsg commented 4 years ago

53 just got significant housekeeping done to it. That's now the appropriate tracking issue for this functionality.

cosmicexplorer commented 4 years ago

That's now the appropriate tracking issue for this functionality.

Awesome!! :D super glad this is being picked up!!! Can this issue be closed, then? Or are there remaining elements that might be useful to split out?

cosmicexplorer commented 4 years ago

(I'd also be interested in productionizing any of the above, if nobody else has planned to work on all of it yet!) I guess I can probably just look at the github project for this!

pradyunsg commented 4 years ago

(I'd also be interested in productionizing any of the above, if nobody else has planned to work on all of it yet!)

Thanks for offering!

I think that these metadata-access-related optimizations (partial wheel download for metadata and a dependency cache) would be super useful and really beneficial, especially in a post-proper-resolver world where pip would actually need metadata of a package more often than it'd try to install them (right now, it's a 1:1 ratio -- if we download it, we'll install it -- see #988).

It would be great if you could explore implementing these changes in the "domain" of the new resolver, since the old one is going to go away pretty soon. If you're able to spend some time to explore how such optimizations might work with the new resolver, that would be pretty great! I promise the new resolver's model will fit the human brain more easily than the older one's. :)

To set expectations early, I do believe that these optimizations are unrelated to actually rolling out that new resolver and we should not complicate that rollout by gaining these capabilities prior to the rollout -- the newer stricter resolver would likely be disruptive enough on it's own, adding more changes like this would make the rollout more... complicated. We'd likely be merging such enhancements to pip master only after the new resolver is rolled out, later this year. (see https://wiki.python.org/psf/Pip2020DonorFundedRoadmap for more information about the deadlines/plan for that rollout)

ofek commented 4 years ago

Hello everyone! Soon I will once again go back to developing hatch which was blocked by the resolver (see https://github.com/ofek/hatch/issues/47#issuecomment-633145063)

However, as I've been out of the loop for a while and seeing now that the resolver work does not include a "resolve" command, I have a question:

How do we expect the resolver to be actually used? Of course we won't get erroneous version conflict errors anymore, which is great for the average end user!

Though, what about for production scenarios? In such cases (at pretty much every company) you will need to check-in the full enumeration of resolved dependencies as a requirement, for a few reasons:

  1. "reproducible" builds - haven't checked the implementation yet, but this may be ameliorated by adding a flag to resolve to the lowest possible version of transitive dependencies (still not a great solution)
  2. security audits

Here at Datadog for example, we would not benefit from the new resolver in our apps nor the Agent and will still require pip-tools' compile functionality.

sbidoul commented 4 years ago

@ofek I personally think pip install --dry-run is the most versatile way to expose the feature (#53). See also a comment by @pradyunsg above https://github.com/pypa/pip/issues/7819#issuecomment-595807861.

cosmicexplorer commented 4 years ago

@pradyunsg:

It would be great if you could explore implementing these changes in the "domain" of the new resolver, since the old one is going to go away pretty soon.

Absolutely!!! I happened to have already rebased the prototype against the resolvelib library that I had made for https://github.com/pypa/pip/issues/7406#issuecomment-595080810 onto the branch I had made for this issue -- I made several incredibly simplifying assumptions in that attempt so I am very glad to just be able to take advantage of the v2 resolver now!

Would it be more appropriate to post further in this issue, or to create a pull request with any updated versions of these ideas for the new resolver? The changes I see would be:

  1. [ ] Memoizing the results of individual steps of the resolution process.
  2. [ ] Adding a feature flag to download dependency metadata for wheel files.
  3. [ ] Adding download URLs to a --dry-run or resolve command as described in https://github.com/pypa/pip/issues/53#issuecomment-633338176.

I think these can be created as separate changes.

I promise the new resolver's model will fit the human brain more easily than the older one's. :)

Yes!! I agree so far!!

To set expectations early, I do believe that these optimizations are unrelated to actually rolling out that new resolver and we should not complicate that rollout by gaining these capabilities prior to the rollout

That sounds just fine!!! We should be able to experiment on our end inside Twitter and hopefully have the ability to test these ideas at some scale before then, which should hopefully improve the quality of the eventual pip contribution.

@ofek:

Hello everyone! Soon I will once again go back to developing hatch which was blocked by the resolver (see ofek/hatch#47 (comment))

I was not familiar with hatch before -- awesome project! I am not sure but I think your message feels really specifically useful for the discussion in #53 and I don't want it to get lost in this thread!! :D :D

pfmoore commented 4 years ago

Would it be more appropriate to post further in this issue, or to create a pull request with any updated versions of these ideas for the new resolver? The changes I see would be:

  • Memoizing the results of individual steps of the resolution process.
  • Adding a feature flag to download dependency metadata for wheel files.
  • Adding download URLs to a --dry-run or resolve command as described in #53 (comment).

I think these can be created as separate changes.

Personally, I think this sounds like a great idea. However, I do want to reiterate the caution @pradyunsg noted above. The new resolver is currently being actively developed, and we won't be able to focus on your changes until at least a beta version is available in the next version of pip (so at the earliest, mid-July). Your changes are from the sound of it, fairly well-contained, but I'd still anticipate that you may need to rebase quite frequently to track the development work, and you may not get a lot of feedback initially. But please don't be discouraged by this - it's just an unfortunate consequence of the timing. Do feel free to ping me (or any of the other devs) if you have anything you would like specific feedback on.

I'm really enthusiastic about this work, so I'm very grateful to you for your efforts here.

pradyunsg commented 4 years ago

Would it be more appropriate to post further in this issue, or to create a pull request with any updated versions of these ideas for the new resolver? The changes I see would be:

I suggest making separate tracking issues for each of these changes, and closing this one in favor of those + #53. :)

cosmicexplorer commented 4 years ago

That all sounds great!! I've been able to use the weekend to get about half of the changes rebased on top of the new resolver, and it's already a lot less code and should be easier to review. A couple more iterations of that should make it ready for review after the new resolver has been out in the wild a bit!

Thanks so much everyone!

I suggest making separate tracking issues for each of these changes, and closing this one in favor of those + #53. :)

Will do!

ofek commented 4 years ago

@cosmicexplorer Thank you very much for doing this 😄

cosmicexplorer commented 4 years ago

I'm super excited about all the new resolver changes, and I already love love love how responsive the pip maintainers have been to the issues we (Twitter) have brought up and been able to fix (e.g. #7729)! The response to this kind of meandering issue in this thread was super positive and I really appreciate how proactive everyone has been about pushing this work into a direction that aligns with the new resolver!

For anyone else watching, I strongly recommend making an issue like this if you have any cool proposed changes to pip (or any internal workarounds you'd like to open-source) -- the pip folks are fantastic to work with and very upfront about timelines.

McSinyx commented 4 years ago

Per my (private) discussion with @pradyunsg, it was agreed that it might be a good idea to try the partial download for the wheel for resolution regardless of the operation, i.e. https://github.com/pypa/pip/issues/7406#issuecomment-582696120. The goal is to enhance the speed of the resolution process and thus that of installation as a whole, which is the aim of my GSoC project. At this point, we figured that we can

  1. Download the wheels partially for resolution
  2. Finish the downloads when resolution is done to proceed on installation

@cosmicexplorer, I think my experiments in the next few days will overlap with your branch print-download-urls a bit. I'm not sure why I'm saying it, maybe because I might need some review from you later on.

ofek commented 4 years ago

@McSinyx That sounds excellent!

ofek commented 4 years ago

@McSinyx Any progress?

McSinyx commented 4 years ago

@ofek, thanks for the reminder. I can't get it to work any more, although I'm pretty sure I got it working last weekend:

Simply appending Range: bytes=-8000 to the request header in pip._internal.network.download makes the resolution process lightning fast. Of course this breaks the installation but I am confident that it is not difficult to implement this optimization cleanly.

IIRC the broken part was with when a dependency is pinned as hash (happened to one of the 3D library that I use for testing that day), but I can no longer reproduce it.

My confidence :smile: led me to refactor the networking code with the change (McSinyx/pip#1), just to realize that pip fails even before dependencies are gotten. I'll spend sometime figuring if the hack doesn't work with some new commits to master or my mind was malfunctioning that day.

What I'm experimenting with ATM is to make the hack works directly for pip {install,download,wheel} (any sub-command that need to resolve deps), and out of curiosity I wonder if that also benefit hatch as much as a dry run would.

ofek commented 4 years ago

@McSinyx I started re-writing Hatch 2 weeks ago, hopefully I'll have a beta + docs site by end of July.

Yes, my plan is to never write a custom resolver and simply use what pip provides.

I'm confident you can get the range header way working because that is what I experimented with 2 years ago. I never committed my branch because I knew pip would eventually get its own resolver, and someone such as yourself would implement partial downloads for better performance 😉

dholth commented 4 years ago

Let me tell you about sparse files. file.truncate(length) can be used to make a file bigger. If you start with an empty file then the file will appear to have length bytes, all zero. but it will not take up space on disk until you write actual data. Then you can write your -8000 bytes in the correct place and fill in the rest later.

ofek commented 4 years ago

I was doing:

def create_null_file(path, size=1):
    with open(path, 'wb') as f:
        f.seek(size - 1)
        f.write(b'\x00')
cosmicexplorer commented 4 years ago

@dholth I created a library based on your httpfile repo at https://github.com/cosmicexplorer/httpfile, see https://github.com/pypa/pip/pull/8442#issuecomment-644646873 for context. As @McSinyx suggests in https://github.com/pypa/pip/pull/8442#issuecomment-644595156, it would likely make sense to move this functionality into pip._internal.networking. Since I already took the time to create very quick unit tests, I should be able to whip up a branch or draft PR which has the "shallowly download wheel metadata" code unit tested and post it in this issue.

cosmicexplorer commented 4 years ago

I've created a branch at https://github.com/pypa/pip/compare/master...cosmicexplorer:add-httpfile-shallow-wheel-download-utils?expand=1 with utilities to shallowly download wheel metadata. There is also testing added. I would create a draft PR, but it's not quite clear to me how this will hook up to the existing code yet.

I will be looking into making an AbstractDistribution subclass which represents a distribution that was shallowly resolved from a remote wheel's METADATA file without downloading the rest of the wheel. I'll see if that will allow me to plug this into the v2 resolver instead of downloading and then separately preparing a wheel.

McSinyx commented 4 years ago

@cosmicexplorer, @dholth, the magical file-like object is not as magical as I hope :smile: As ZipFile.filelist contains broken files, and the zipfile interface doesn't seem to provide/waranty any way to check for each file. Presumably if we follow the specs we can locate the interval containing child file but I'm thinking about simply download from bottom up and get another chunk each time the routine using the wheel fails. Edit: I might be wrong, I made some mistakes in my code. Edit: I was wrong, and I'm happy that was wrong.

nlhkabu commented 4 years ago

@brainwane @ei8fdb - flagging this as important from a UX perspective. See https://github.com/pypa/pip/issues/53#issuecomment-645425852 and following responses for context.

jsar3004 commented 4 years ago

Hello, I am a beginner in open source. I want to learn and contribute to pip but I can't understand the complex code. Please guide me on how to get started and what to learn if any skills are required.

pradyunsg commented 4 years ago

Let me know if people aren't ready for this yet

FWIW, I hadn't communicated w/ everyone to figure out how we would be picking up the various parts of this task and implementing things.


@cosmicexplorer #8448 is a fairly large PR and I have a very strong bias toward keeping PRs smaller and following a change-a-single-thing per PR/commit style, to make it easier to review. IIUC, there's 2 functional changes in this PR that we should break off into separate PRs:

Notably, there's also a logical/semnatic change in this PR -- we're no longer guaranteeing that the requirement_set returned after Resolver.resolve can be used immediately for installation (technically, we don't do that today, but that's because of sdists, not wheels).

I suggest we break that up into smaller chunks, that we tackle one-by-one:

  1. partial wheel downloads for metadata (keeping "download entire file" as part of prepare)
  2. move out "download entire file" out of prepare (and hence, Resolver.resolve)
  3. parallelize download of multiple files

Related context: https://github.com/pypa/pip/pull/8532#issuecomment-658865542

Right now, @McSinyx would be updating #8532.

I think we should probably have a couple of follow up PRs to (a) refactor/move the logic for "download entire file" and then (b) "new feature" implementation to parallelize those downloads (i.e. considering user-facing behavior, output etc). After #8532 is finalized, the main blocker on that front for #53/#7819, would be moving the "download entire file" logic out of the resolver's scope. For @McSinyx's GSoC project, the parallelization of the downloads (and the corresponding UI / UX work) would be the next big-fish task for them to work on.

Here's how I suggest we move forward on the overlap of this issue and @McSinyx's GSoC project:

Does that sound like a reasonable plan to going forward @cosmicexplorer @McSinyx @pypa/pip-committers? I don't want folks stepping on each other's toes. :)

cosmicexplorer commented 4 years ago

This planning effort above is super super welcome! Thank you!

IIUC, there's 2 functional changes in this PR that we should break off into separate PRs:

  • partial wheel metadata download support
  • parallelization of wheel downloads ("hydration" here)

Correct!

Notably, there's also a logical/semnatic change in this PR -- we're no longer guaranteeing that the requirement_set returned after Resolver.resolve can be used immediately for installation

Yes! You've correctly identified the main source of complexity, imho.

I suggest we break that up into smaller chunks, that we tackle one-by-one:

  1. partial wheel downloads for metadata (keeping "download entire file" as part of prepare)
  2. move out "download entire file" out of prepare (and hence, Resolver.resolve)
  3. parallelize download of multiple files

I like this plan a lot!

  • @cosmicexplorer (or @McSinyx, if @cosmicexplorer isn't available) refactors the logic for "download entire file" out of the resolver's scope.
    • once this is done, we can move on to other parts of #7819, like the dependency cache or #53 even.

:D the dependency cache and #53 would be all the remaining parts needed for a massive performance improvement to pantsbuild/pants#8793, so this would be super exciting. I will try to dive into this refactoring step you've described this week to unblock the rest.

works on figuring out how download parallelization

Definitely agree this should be broken out! It's quite possible that the implementation here is much, much slower than it could be by manually creating threads, in addition to stomping all over download progress output and making it unreadable.

nlhkabu commented 4 years ago

Hello, I am a beginner in open source. I want to learn and contribute to pip but I can't understand the complex code. Please guide me on how to get started and what to learn if any skills are required.

@jsar3004 - If you're new to open source and want to contribute to Python packaging, I recommend starting with the Warehouse project. This is the codebase that powers pypi.org. There are tickets that are marked as good for first time contributors - start there :)

jku commented 3 years ago

See also Warehouse issue "Expose the METADATA file of wheels in the simple API": https://github.com/pypa/warehouse/issues/8254 -- I think this would solve similar issues with less clients side tricks? Of course the Warehouse feature is not implemented yet...

Also, I've started working on issue #8585 "Secure PyPI downloads with signed repository metadata": the comment I have on partial downloads is that I think those essentially cannot be verified -- there is no way to know if you are being served the metadata that was originally created. In fact there's no way to know if the metadata would even match the file hash included in the index file: a distribution mirror could serve whatever metadata it wants and it would get processed... Not sure if there's a practical attack here but the possibility seems real.

cosmicexplorer commented 3 years ago

Yes! And that is exactly the solution that people at Twitter including @kwlzn have proposed that we use to solve this. My interest in the client side approach is that it solves the problem for other people using tensorflow at large corporations who don’t pull from PyPI. We host an Artifactory instance, and I haven’t yet delved into how easy it would be to make the modifications to support the METADATA files as in the warehouse PR.

It seems to me that both of these approaches, when shipped to production, would likely have similar performance characteristics and produce the same result. I expect the PyPI change might end up being faster in the end, but I don’t know if, for example, some file contents get cached by the web server, and until most people are using the METADATA approach, it might end up being faster to pull tensorflow’s METADATA directly from the zip for that reason.

If this becomes outclassed by the working PyPI solution, I believe it still might not be replaceable for people who for whatever reason don’t have control of where they download their wheels from (and therefore can’t get a resolve using the metadata info). I don’t know how many of these people there are.

the comment I have on partial downloads is that I think those essentially cannot be verified -- there is no way to know if you are being served the metadata that was originally created. In fact there's no way to know if the metadata would even match the file hash included in the index file: a distribution mirror could serve whatever metadata it wants and it would get processed... Not sure if there's a practical attack here but the possibility seems real.

So this is an extremely reasonable concern, and my first thought is that if we’re thinking about adding METADATA files to PyPI that those would probably have checksummed urls too? So the more canonical warehouse approach seems like it would be beneficial for security and that would be a great reason to retire this in favor of that once it gets going.

Separately, however, I’m not entirely sure how, if you have known checksums for wheels, that you could possibly avoid eventually checking those checksums during a pip resolve. The prototype I’ve implemented (which I’ve been meaning to spend more time on recently) will use the metadata information to pull down URLs to download everything from along the way, then download all the wheels in parallel at the end, presumably checking checksums, although I need to verify that.

I am working on another approach (it works too) that modifies pip to write resolve URLs to stdout instead of actually downloading anything, and then downloads them all in one go in parallel, when the application is first started. By not downloading the wheels and checking the checksums in the same pip invocation that gets the URLs, I can definitely see a potential avenue for exploitation. However, we still have to pull down wheels in the end, and pex just uses pip to resolve now, so it should be checking checksums in the same places where pip does.

I’m vaguely familiar with where checksum validation happens in pip but not enough to answer more confidently. I think security should generally be a huge concern when proposing massive changes to pip resolves and I think that it needs a little more research on my part to be able to say more confidently that it’s not going to introduce a huge issue.

EDIT: One last possible twist on this is that along with the zipfile-searching part of this PR, it also adds a cache of dependencies, keyed by the requirement download URL, serialized in a json file, and stored across pip runs. If we wanted to methodically address the checksumming issue, it's possible we could store checksums from previous downloads there. That code is hairy and needs to be replaced anyway though, and I'm not sure how big that json file would get over time especially if we started adding longer strings to it. It would would be at best a workaround for the problem -- I believe the known attack vector of pulling a newly released version of a dependency from PyPI would reliably avoid the json cache, so it's not a solution here.

ofek commented 2 years ago

What is the status of this?

uranusjr commented 2 years ago

I think the main thing left is a format to (optionally) output the resolution result into (PEP 665, see #10636). So let’s wait for that PEP to land first.

cosmicexplorer commented 1 year ago

This work has largely been merged, although #12186 tracks in-flight changes to fix up metadata resolves and #12184 covers how to upstream the remaining caching of metadata lookups done in the prototype described here.