Collecting from cache is surprisingly slow

bluenote10 commented 3 years ago

What's the problem this feature will solve?

Workflows that require a frequent re-creation of virtual envs from scratch are very slow, and a large amount of time is spend collecting wheels despite having all the artefacts in the local cache already.

To demonstrate, let's consider installing from this (pip-compiled) requirements.txt:

requirements.txt

``` # # This file is autogenerated by pip-compile with python 3.8 # To update, run: # # pip-compile --no-emit-index-url requirements.in # ansible==4.5.0 # via -r requirements.in ansible-core==2.11.4 # via ansible asgiref==3.4.1 # via django attrs==21.2.0 # via pytest backcall==0.2.0 # via ipython black==21.8b0 # via -r requirements.in bokeh==2.3.3 # via -r requirements.in brotli==1.0.9 # via flask-compress cashier==1.3 # via -r requirements.in certifi==2021.5.30 # via requests cffi==1.14.6 # via cryptography charset-normalizer==2.0.4 # via requests click==8.0.1 # via # -r requirements.in # black # flask cloudpickle==2.0.0 # via dask cryptography==3.4.8 # via ansible-core cycler==0.10.0 # via matplotlib dash==2.0.0 # via -r requirements.in dash-core-components==2.0.0 # via dash dash-html-components==2.0.0 # via dash dash-table==5.0.0 # via dash dask==2021.9.0 # via -r requirements.in decorator==5.1.0 # via ipython django==3.2.7 # via -r requirements.in docker==5.0.2 # via -r requirements.in flake8==3.9.2 # via -r requirements.in flask==2.0.1 # via # -r requirements.in # dash # flask-compress flask-compress==1.10.1 # via dash fsspec==2021.8.1 # via dask idna==3.2 # via requests importlib-resources==5.2.2 # via yachalk iniconfig==1.1.1 # via pytest ipython==7.27.0 # via -r requirements.in itsdangerous==2.0.1 # via flask jedi==0.18.0 # via ipython jinja2==3.0.1 # via # -r requirements.in # ansible-core # bokeh # flask joblib==1.0.1 # via scikit-learn kiwisolver==1.3.2 # via matplotlib locket==0.2.1 # via partd markupsafe==2.0.1 # via jinja2 matplotlib==3.4.3 # via -r requirements.in matplotlib-inline==0.1.3 # via ipython mccabe==0.6.1 # via flake8 mypy==0.910 # via -r requirements.in mypy-extensions==0.4.3 # via # black # mypy numpy==1.21.2 # via # -r requirements.in # bokeh # matplotlib # pandas # pyarrow # scikit-learn # scipy packaging==21.0 # via # ansible-core # bokeh # dask # pytest pandas==1.3.3 # via -r requirements.in parso==0.8.2 # via jedi partd==1.2.0 # via dask pathspec==0.9.0 # via black pexpect==4.8.0 # via ipython pickleshare==0.7.5 # via ipython pillow==8.3.2 # via # bokeh # matplotlib platformdirs==2.3.0 # via black plotly==5.3.1 # via dash pluggy==1.0.0 # via pytest prompt-toolkit==3.0.20 # via ipython ptyprocess==0.7.0 # via pexpect py==1.10.0 # via pytest pyarrow==5.0.0 # via -r requirements.in pycodestyle==2.7.0 # via flake8 pycparser==2.20 # via cffi pyflakes==2.3.1 # via flake8 pygments==2.10.0 # via ipython pyparsing==2.4.7 # via # matplotlib # packaging pytest==6.2.5 # via -r requirements.in python-dateutil==2.8.2 # via # bokeh # matplotlib # pandas pytz==2021.1 # via # django # pandas pyyaml==5.4.1 # via # -r requirements.in # ansible-core # bokeh # dask regex==2021.8.28 # via black requests==2.26.0 # via # -r requirements.in # docker resolvelib==0.5.5 # via ansible-core scikit-learn==0.24.2 # via sklearn scipy==1.7.1 # via # -r requirements.in # scikit-learn six==1.16.0 # via # cycler # plotly # python-dateutil sklearn==0.0 # via -r requirements.in sqlitedict==1.7.0 # via -r requirements.in sqlparse==0.4.2 # via django tenacity==8.0.1 # via plotly threadpoolctl==2.2.0 # via scikit-learn toml==0.10.2 # via # mypy # pytest tomli==1.2.1 # via black toolz==0.11.1 # via # dask # partd tornado==6.1 # via bokeh traitlets==5.1.0 # via # ipython # matplotlib-inline typing-extensions==3.10.0.2 # via # black # bokeh # mypy urllib3==1.26.6 # via requests wcwidth==0.2.5 # via prompt-toolkit websocket-client==1.2.1 # via docker werkzeug==2.0.1 # via flask yachalk==0.1.4 # via -r requirements.in zipp==3.5.0 # via importlib-resources ```

Benchmark script for reproduction:

# cleanup tmp venv if it exists
rm -rf ./tmp_venv

# re-create tmp venv
virtualenv ./tmp_venv

# activate tmp venv
. ./tmp_venv/bin/activate

# install requiremts
time pip install -r requirements.txt --no-deps -i https://pypi.python.org/simple

Starting with the second execution of the script, pip can fully rely on its local cache. But even with 100% cache hits, the time it takes to run all the Collecting ... Using cached ... operations takes ~42 seconds. The total run time is ~92 seconds, so 45% of the time is spend just collecting from the cache. This time seems excessive. The disk is a fast SSD and the total sum of data to be collected should be < 1GB. So in terms of I/O it should be possible to collect the artefacts much faster from an SSD based cache.

In terms of raw I/O loading the wheels from an SSD based cache should be in the order of few seconds. Thus, bringing down the collection time could speed up venv creation in many cases by almost a factor of 2. This could e.g. significantly speed up CI pipelines that require creation of multiple similar venvs (in fact, venv creation is becoming an increasing bottleneck in complex CI pipelines for us).

Used pip version: 21.2.4

Describe the solution you'd like

Perhaps it is possible to revisit why collecting artefacts from the cache is so slow.

Alternative Solutions

Alternative solutions probably doesn't apply in case of performance improvements.

Additional context

All information given above?

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

uranusjr commented 3 years ago

so 45% of the time is spend just collecting from the cache.

That sounds about right to me TBH. If you do --no-dep, the installation process is basically just loading cache to memory, and write files from memory to destination, so 45% cache reading, 45% writing files, plus 10% misc overhead is not unreasonable. The percentage does not really mean anything here, only absolute time does.

bluenote10 commented 3 years ago

That sounds about right to me TBH. If you do --no-dep, the installation process is basically just loading cache to memory, and write files from memory to destination

I was hoping that --no-deps is that lightweight. But looking at the absolute numbers, there seems to be a significant overhead:

I can cp -ar the entire virtual env directory in ~4 seconds (for 1.5 GB).
I can cp -ar the entire pip cache directory in ~6 seconds (for 3.4 GB).
The pip install takes ~92 seconds, which feels > 10 times slower than expected judging based on the necessary I/O.

I'm wondering if there are some low hanging fruits to reduce that overhead? Perhaps my feature request should have been rather: Introduce optimized/fast mode for fully pip-compiled requirement lists. Does that make more sense?

jbylund commented 3 years ago

@bluenote10 thanks for the clear reproduction instructions, makes it so much easier to look at something like this.

I took a profile on my machine and found that it was spending 25s on 92 calls of _get_html_response. I'm guessing this is necessary to verify that there hasn't been a newer version of a given package published & that a previous version hasn't been yanked.

On the one hand this is a good opportunity to thread the http calls, because they're all independent. But... right now criteria resolution happens on a sort of incremental basis and that will need to be moved a little.

I applied approximately this diff and tried again. Total time for commands.run down from ~95s to ~65s and time for resolver.resolve down from ~38s to ~9s. Now I'm thinking the easiest approach might be to "pre-warm" an http cache in parallel and then not change the structure here?

--- a/src/pip/_vendor/resolvelib/resolvers.py
+++ b/src/pip/_vendor/resolvelib/resolvers.py
@@ -1,5 +1,6 @@
 import collections
 import operator
+import multiprocessing.dummy

 from .providers import AbstractResolver
 from .structs import DirectedGraph, IteratorMapping, build_iter_view
@@ -336,11 +337,16 @@ class Resolution(object):

         # Initialize the root state.
         self._states = [State(mapping=collections.OrderedDict(), criteria={})]
-        for r in requirements:
+
+        def inner(r):
             try:
                 self._add_to_criteria(self.state.criteria, r, parent=None)
             except RequirementsConflicted as e:
                 raise ResolutionImpossible(e.criterion.information)
+            return r
+        with multiprocessing.dummy.Pool() as tp:
+            for r_out in tp.imap_unordered(inner, requirements):
+                pass

         # The root state is saved as a sentinel so the first ever pin can have

jbylund commented 3 years ago

* I can `cp -ar` the entire pip cache directory in ~6 seconds (for 3.4 GB).

* The pip install takes ~92 seconds, which feels > 10 times slower than expected judging based on the necessary I/O.

So in the process of taking a profile I think I have a bit of an idea of where that time is going (and it happens that timings on my machine are close enough to yours to be comparable, ~90s in total on main, ~5-6s to copy files...).

Of that 90s, ~40s were spent fetching pages from pypi, that can be parallelized which decreases the wall clock time, but it still takes ~10s. ~50s are spent on "install_wheel", of which only ~5-6s is spent in shutil.copyfileobj. A huge fraction of the remainder under install_wheel is spent in py_compile.

pradyunsg commented 3 years ago

Please share the requirements.txt that you're experimenting with to reproduce this.

jbylund commented 3 years ago

Please share the requirements.txt that you're experimenting with to reproduce this.

Should be in the first comment in this issue.

bluenote10 commented 3 years ago

Great findings, thanks a lot for looking into it!

Of that 90s, ~40s were spent fetching pages from pypi, that can be parallelized which decreases the wall clock time, but it still takes ~10s.

That sounds indeed like a low hanging fruit, with a very welcome speedup :+1:

I'm wondering if it would be feasible to store the http request results themselves in the cache as well for an extra speed-up. For dependency==x.y.z it would be possible to look in the cached http file for a possible match. If none is found, the http cache is invalidated, running a real http request. However, unversioned dependencies or <, <=,>=, > operators probably must never use the cache. Also there can be subtle issues, e.g. if version x.y.z is first published as a source tarball, and only later as a wheel. If in this case the http request caches the state where only the tarball is available, it would continue to install version x.y.z from source instead of the preferred wheel. Not sure if there is a good solution to this problem, may cause too much headache.

~50s are spent on "install_wheel", of which only ~5-6s is spent in shutil.copyfileobj. A huge fraction of the remainder under install_wheel is spent in py_compile.

Does that refer to the compilation of .py files to .pyc? So perhaps it would make sense to have an --dont-compile-pyc option? This particular venv contains 31048 .pyc files totalling a size of 280 MB. In fact this would also help with another issue of virtual envs becoming very large, which can be an issue when trying to create small-ish docker images. If it is indeed the .pyc creation and it could be skipped, it would mean a massive speed-up with a ~20% size reduction in this example. EDIT: Stupid me, this option already exists, just under the name --no-compile (always thought it refers to e.g. Cython), and it indeed gives the expected speed-up.

If I do the math correctly, the parallel requests combined with skipping .pyc creation would bring the time down to ~15 sec, which would be awesome.

yurj commented 1 year ago

using importlib.metadata in Python 3.11 (pip 22.1 and above) has worsened the situation:

Case: pip install Plone

On Python 3.10: first time 1m26, second time 0m12 On Python 3.11: first time 2m18, second time 3m42

@mauritsvanrees found that:

bin/pip install -U 'pip<22.1' and a rerun on 3.11 takes only 9 seconds. The changelog for 22.1 says (https://pip.pypa.io/en/stable/news/#v22-1): "Enable the importlib.metadata metadata implementation by default on Python 3.11 (or later). The environment variable _PIP_USE_IMPORTLIB_METADATA can still be used to enable the implementation on 3.10 and earlier, or disable it on 3.11 (by setting it to 0 or false)."

export _PIP_USE_IMPORTLIB_METADATA=0 and retry on 3.11 with latest pip/setuptools/wheel, and it takes 12 seconds. On 3.10 with export _PIP_USE_IMPORTLIB_METADATA=1, and it slows down a lot: a retry takes 5 minutes

jbylund commented 1 year ago

I haven't profiled what you describe yet, but it sounds like a separate issue with a similar symptom. If others agree I'd propose re-naming this issue to be about parallelizing network requests and we could make a separate issue for what sounds like metadata parsing related?

yurj commented 1 year ago

I haven't profiled what you describe yet, but it sounds like a separate issue with a similar symptom. If others agree I'd propose re-naming this issue to be about parallelizing network requests and we could make a separate issue for what sounds like metadata parsing related?

I agree.

mauritsvanrees commented 1 year ago

Separate issue created here: https://github.com/pypa/pip/issues/12079

On the off chance that this affects the current issue as well, you could try my small PR: https://github.com/pypa/pip/pull/12080 It probably won't fix the current. But maybe it points in the direction of a solution.

pypa / pip