Open McSinyx opened 4 years ago
Some of these may be better split off into their own topics, but here's what I have:
There are several things along the path from the place in the code where we know we need to download something to actually getting it:
PipSession
and its related plugins/helpers:
MultiDomainBasicAuth
TempDirectory
It would be good to know that those are all safe to use concurrently however we're planning to do it (whether threading, multiprocessing, or async). It would also be good to know how we plan to mitigate:
Depending on the command, the file may already exist:
pip install
- wheel cache or HTTP cachepip wheel
- wheel download folder, wheel cache, or HTTP cachepip download
- download folder, wheel cache, or HTTP cacheIn all three cases, I think the wheel cache piece may be handled earlier during resolving, so we wouldn't inadvertently use a lazy wheel for it.
In all three cases, it may be the case that the file we want is already in the HTTP cache. I don't know if range requests bypass the HTTP cache or it does what we want (which would be to return the requested bytes from the file on disk). Maybe it is worth it to see if the request is cached first, and just skip the lazy wheel?
In pip wheel
and pip download
, the user may have already downloaded some of the files, and we would want to be able to use those rather than lazily downloading the metadata. Note that the logic refactored in #8685 does not account for this.
Given that fast-deps only applies to wheels, the following needs to be true:
pip download
, then the wheel file is in the download directory before pip exitspip wheel
, then the wheel file is in the download directory before pip exitspip install
, then the local_file_path
property on InstallRequirement needs to be pointing to the downloaded wheel file before we begin installation, and the wheel file can't be removed until at least after we're done with installationIf we also want to match what we currently do, then:
TempDirectory(globally_managed=True)
can help with some of the cleanup, but may not be currently safe depending on what kind of concurrency is used, so an intermediate temporary directory may be needed.
Thank you for the write up, I'll try to address each of these as time goes by. I just want to let you know that I have seen your comment.
One thing that could help in any situation is an integration test that reproduces the kind of situation that we're trying to improve. Like installing some big requirements file and being able to tell how much of that was due to download, getting metadata, or some other part of processing. Once we have something basic we can decide what other effects to try and account for, or what would look the most like a typical user's situation.
For the network parallelization piece, one approach that may sidestep a lot of the concerns mentioned above would be to:
I don't know how this will play with the HTTP cache, but in this way the "parallel" part of the code will not impact code out of our control (keyring plugins) and there are no possible edge cases for authentication prompting. Regarding performance it may be reasonable to do since a quick look shows shutil.copyfileobj
uses os.sendfile
which may release the GIL for the actual copying. I really wouldn't want to guess more about how useful that would be without a concrete test that shows what we're optimizing for.
_Originally posted by @McSinyx in https://github.com/pypa/pip/pull/8638#discussion_r464979377_
Regarding my GSoC project on download parallelization, now we have:
fast-deps
(thank you @chrahunt for the clean up at GH-8685)In the current code base, when
fast-deps
is used, RequirementPreparer.prepare_linked_requirement may skip the actual preparation (including some checks, downloading the whole wheel and moving it around) and save it for right before the new resolver returning the requirement set. Essentially the download preparation is what we want to execute in parallel, however to obtain progress data for display, it is not a good idea tomap_multithreading
the method directly.With that in mind, we may want to invoke the download before the final preparation calls (the following snippet).
https://github.com/pypa/pip/blob/ee4371c38697098b45df7503e52635a952036abc/src/pip/_internal/resolution/resolvelib/resolver.py#L162-L165
@pradyunsg and I came to a conclusion that it might be a good idea to have the path to download files stored in the InstallRequirement to be able to cleanly skip the call to unpack_url.
This comes out as the output of my discussion with @pradyunsg after a few unsuccessful attempts of mine to solve the issue. He suggested that a issue should be filed to gather opinions from other maintainers (as well interested contributors) to construct the most complete view on the matter. Please feel free to share what's in your mind!