Open axbycc-mark opened 3 months ago
That's likely Pip's doing. When using package indexes with the metadata API, it's more efficient. Also, when the web server supports sending partial byte ranges, Pip's resolver can also only download the beginning of wheels that contain metadata. pip-tools doesn't handle caching by itself.
I have retested against pip 24.0 and the round-tripping issue was fixed. They separated out the header from the rest of the body so that the body is only loaded one time. However, the file copying issue persists. The .whl are copied to a TMP_DIR instead of the pip tool cache.
That's likely Pip's doing. When using package indexes with the metadata API, it's more efficient. Also, when the web server supports sending partial byte ranges, Pip's resolver can also only download the beginning of wheels that contain metadata. pip-tools doesn't handle caching by itself.
I figured. But I made the decision to post here because I thought maybe it's pip's expectation that dependency resolution would never occur without a corresponding install command. In that case, it would be pip-compile improperly calling down to pip.
It could be pip-compile's responsibility to reach into the resolver.factory._link_candidate_cache to move all the files from the TMP_DIR into the pip-compile cache directory so that subsequent runs of pip-compile would find them.
Edit: I'm seeing after experimenting that pip does not automatically put installed .whl files into the cache directory like I thought. So the intended behavior from pip is to make a new copy of the pytorch whl into the temp directory every time you install it from cache.
Environment Versions
Steps to replicate
I'm running pip-compile as part of a Bazel project to build my requirements lock file. Bazel does not install the libraries into the standard location or directory structure, so the caching mechanism doesn't quite work. I believe anyone who uses pip-compile without a corresponding pip install will hit this issue.
Consider the following requirements.in file, which contains only one dependency.
In a new Python virtual environment, run the following commands
This downloads the pytorch library which is about 2 GB. Now run the command again.
The command seems to hang after the following output, but actually it is still running and resumes after around 30 seconds.
Here is what is happening during the second run.
SOME_CACHE_DIR/wheels/...
which fails because the wheel is not present.SOME_CACHE_DIR/http/...
because it was cached by the previous run of pip-compile.TMP_DIR/pip-unpack-xxxxx/torch-xxxxx.whl
.This ends up eating 2GB of disk space every time I run pip-compile by writing duplicate .whl files into the TMP_DIR. It also does not save time compared to actually downloading the .whl again because there are so many round trips to disk and time spent in messagepack serializing and deserializing the same file. I think the issue would be fixed if the .whl file is instead written to SOME_CACHE_DIR so that the caching mechanism could find it. Maybe this could be hacked in on the pip-tools side by copying and pasting any .whl files into SOME_CACHE_DIR.
Expected result
Running pip-compile twice in a row with no requirements.in changes should be fast, and used cached .whl files from the first run.
Actual result
Running pip-compile twice in a row with no requirements.in changes is slow, and does not reuse .whl files from the first run directly. It instead recreates them from the http cache in a round-about way, and places the results in to TMP_DIR where they are not discovered by subsequent runs.