pip-compile slow and wasteful caching behavior for large packages (eg pytorch)

axbycc-mark commented 3 months ago

Environment Versions

Windows 10
Tested with Python 3.10.10 and Python 3.11.7
Pip versions 22.3.1 and 23.3.2
pip-tools version: 7.4.1

Steps to replicate

I'm running pip-compile as part of a Bazel project to build my requirements lock file. Bazel does not install the libraries into the standard location or directory structure, so the caching mechanism doesn't quite work. I believe anyone who uses pip-compile without a corresponding pip install will hit this issue.

Consider the following requirements.in file, which contains only one dependency.

# requirements.in
torch==2.2.1+cu118

In a new Python virtual environment, run the following commands

pip install pip-tools
pip-compile --extra-index-url https://download.pytorch.org/whl/cu118 --cache-dir SOME_CACHE_DIR --verbose

This downloads the pytorch library which is about 2 GB. Now run the command again.

pip-compile --extra-index-url https://download.pytorch.org/whl/cu118 --cache-dir SOME_CACHE_DIR --verbose

The command seems to hang after the following output, but actually it is still running and resumes after around 30 seconds.

Using indexes:
  https://pypi.org/simple
  https://download.pytorch.org/whl/cu118
  Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118

                          ROUND 1
  Collecting torch==2.2.1+cu118

Here is what is happening during the second run.

Candidates for the package "torch" is sought from SOME_CACHE_DIR/wheels/... which fails because the wheel is not present.
An HTTP request is created fetch the package again.
The cache controller reports an HTTP cache hit from SOME_CACHE_DIR/http/... because it was cached by the previous run of pip-compile.
The entire 2GB serialized response from the previous run is deserialized from disk so that its HTTP headers can be examined. (This takes 6 to 12 seconds on my machine.)
The cached response is determined to be stale because there is no "max-age" or "expires" header value. The response is tossed.
A conditional request is crafted, by adding the "If-None-Match" and "If-Modified-Since" headers. To set these values, the entire 2GB response is deserialized again from disk to retrieve the "Etag" and "Last-Modified" fields.
The request is sent and pytorch.org responds with 304 Not Modified.
The response is deserialized from disk a third time and the response body is written to disk in a new temporary directory TMP_DIR/pip-unpack-xxxxx/torch-xxxxx.whl.
The .whl file is read from disk so that hashes can be computed.
Regular behavior after this.

This ends up eating 2GB of disk space every time I run pip-compile by writing duplicate .whl files into the TMP_DIR. It also does not save time compared to actually downloading the .whl again because there are so many round trips to disk and time spent in messagepack serializing and deserializing the same file. I think the issue would be fixed if the .whl file is instead written to SOME_CACHE_DIR so that the caching mechanism could find it. Maybe this could be hacked in on the pip-tools side by copying and pasting any .whl files into SOME_CACHE_DIR.

Expected result

Running pip-compile twice in a row with no requirements.in changes should be fast, and used cached .whl files from the first run.

Actual result

Running pip-compile twice in a row with no requirements.in changes is slow, and does not reuse .whl files from the first run directly. It instead recreates them from the http cache in a round-about way, and places the results in to TMP_DIR where they are not discovered by subsequent runs.

webknjaz commented 3 months ago

That's likely Pip's doing. When using package indexes with the metadata API, it's more efficient. Also, when the web server supports sending partial byte ranges, Pip's resolver can also only download the beginning of wheels that contain metadata. pip-tools doesn't handle caching by itself.

axbycc-mark commented 3 months ago

I have retested against pip 24.0 and the round-tripping issue was fixed. They separated out the header from the rest of the body so that the body is only loaded one time. However, the file copying issue persists. The .whl are copied to a TMP_DIR instead of the pip tool cache.

axbycc-mark commented 3 months ago

That's likely Pip's doing. When using package indexes with the metadata API, it's more efficient. Also, when the web server supports sending partial byte ranges, Pip's resolver can also only download the beginning of wheels that contain metadata. pip-tools doesn't handle caching by itself.

I figured. But I made the decision to post here because I thought maybe it's pip's expectation that dependency resolution would never occur without a corresponding install command. In that case, it would be pip-compile improperly calling down to pip.

It could be pip-compile's responsibility to reach into the resolver.factory._link_candidate_cache to move all the files from the TMP_DIR into the pip-compile cache directory so that subsequent runs of pip-compile would find them.

Edit: I'm seeing after experimenting that pip does not automatically put installed .whl files into the cache directory like I thought. So the intended behavior from pip is to make a new copy of the pytorch whl into the temp directory every time you install it from cache.

jazzband / pip-tools