McSinyx commented 4 years ago

I am truly sorry for not clarifying this earlier in a discussion point and ended up causing a decent amount of confusion :sweat_smile:

_Originally posted by @McSinyx in https://github.com/pypa/pip/pull/8638#discussion_r464979377_

Regarding my GSoC project on download parallelization, now we have:

Multi-threading utilities working
A list wheels that can be downloaded together under the experimental feature fast-deps (thank you @chrahunt for the clean up at GH-8685)

In the current code base, when fast-deps is used, RequirementPreparer.prepare_linked_requirement may skip the actual preparation (including some checks, downloading the whole wheel and moving it around) and save it for right before the new resolver returning the requirement set. Essentially the download preparation is what we want to execute in parallel, however to obtain progress data for display, it is not a good idea to map_multithreading the method directly.

With that in mind, we may want to invoke the download before the final preparation calls (the following snippet).

https://github.com/pypa/pip/blob/ee4371c38697098b45df7503e52635a952036abc/src/pip/_internal/resolution/resolvelib/resolver.py#L162-L165

@pradyunsg and I came to a conclusion that it might be a good idea to have the path to download files stored in the InstallRequirement to be able to cleanly skip the call to unpack_url.

This comes out as the output of my discussion with @pradyunsg after a few unsuccessful attempts of mine to solve the issue. He suggested that a issue should be filed to gather opinions from other maintainers (as well interested contributors) to construct the most complete view on the matter. Please feel free to share what's in your mind!

chrahunt commented 4 years ago

Some of these may be better split off into their own topics, but here's what I have:

Handling parallelization

There are several things along the path from the place in the code where we know we need to download something to actually getting it:

requests, since that's what we use for HTTP
our custom PipSession and its related plugins/helpers:
1. MultiDomainBasicAuth
  1. keyring
  2. any keyring plugins installed by users
2. our HTTP cache (link)
TempDirectory

It would be good to know that those are all safe to use concurrently however we're planning to do it (whether threading, multiprocessing, or async). It would also be good to know how we plan to mitigate:

anything else sneaking into that list
changes to the things on the list making them not safe

Using existing files where possible

Depending on the command, the file may already exist:

pip install - wheel cache or HTTP cache
pip wheel - wheel download folder, wheel cache, or HTTP cache
pip download - download folder, wheel cache, or HTTP cache

In all three cases, I think the wheel cache piece may be handled earlier during resolving, so we wouldn't inadvertently use a lazy wheel for it.

In all three cases, it may be the case that the file we want is already in the HTTP cache. I don't know if range requests bypass the HTTP cache or it does what we want (which would be to return the requested bytes from the file on disk). Maybe it is worth it to see if the request is cached first, and just skip the lazy wheel?

In pip wheel and pip download, the user may have already downloaded some of the files, and we would want to be able to use those rather than lazily downloading the metadata. Note that the logic refactored in #8685 does not account for this.

Desired outcome of any download process

Given that fast-deps only applies to wheels, the following needs to be true:

if running pip download, then the wheel file is in the download directory before pip exits
if running pip wheel, then the wheel file is in the download directory before pip exits
if running pip install, then the local_file_path property on InstallRequirement needs to be pointing to the downloaded wheel file before we begin installation, and the wheel file can't be removed until at least after we're done with installation

If we also want to match what we currently do, then:

if the URL contained a hash, then check that against the downloaded file before using it
if any error occurs during processing, and the file was not downloaded into a user-provided download directory, then we clean up the already-downloaded files before exiting
if everything succeeds, and the file was not downloaded into a user-provided download directory, then we should remove the file when pip terminates

TempDirectory(globally_managed=True) can help with some of the cleanup, but may not be currently safe depending on what kind of concurrency is used, so an intermediate temporary directory may be needed.

McSinyx commented 4 years ago

Thank you for the write up, I'll try to address each of these as time goes by. I just want to let you know that I have seen your comment.

chrahunt commented 4 years ago

One thing that could help in any situation is an integration test that reproduces the kind of situation that we're trying to improve. Like installing some big requirements file and being able to tell how much of that was due to download, getting metadata, or some other part of processing. Once we have something basic we can decide what other effects to try and account for, or what would look the most like a typical user's situation.

For the network parallelization piece, one approach that may sidestep a lot of the concerns mentioned above would be to:

for each file to be downloaded
1. initiate the request and read the response headers
2. delegate reading the response body to a separate thread
wait for all response body downloads to complete

I don't know how this will play with the HTTP cache, but in this way the "parallel" part of the code will not impact code out of our control (keyring plugins) and there are no possible edge cases for authentication prompting. Regarding performance it may be reasonable to do since a quick look shows shutil.copyfileobj uses os.sendfile which may release the GIL for the actual copying. I really wouldn't want to guess more about how useful that would be without a concrete test that shows what we're optimizing for.

pypa / pip

[fast-deps] How should we isolate the networking code for late download in parallel? #8697

Handling parallelization

Using existing files where possible

Desired outcome of any download process