New resolver takes 1-2 hours to install a large requirements file

bdraco commented 1 year ago

Description

This is a followup to #10788

Attached is a callgrind pip_callgrind.zip

The bulk of the time is spent creating Version objects

Expected behavior

No response

pip version

pip 23.2.1 from /root/ztest/lib/python3.11/site-packages/pip (python 3.11)

Python version

3.11

OS

alpine 3.18 linux

How to Reproduce

Must use alpine 3.18 linux

python3 -m cProfile -o all.pstats -m pip install --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements_all.txt

https://github.com/home-assistant/core/blob/dev/requirements_all.txt

or

Dockerfile https://github.com/home-assistant/core/blob/dev/Dockerfile

Output

No response

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

notatallshaw commented 1 year ago

Hey, I'll take a look at this in more detail tonight but just quickly:

The bulk of the time is spent creating Version objects

If I'm reading your graph correctly 9% of the time is spent on version objects? So, if I understand correctly, that's not the same as creating and not really the bulk of the time. But feel to correct me.

Edit: I see you already have a PR that proves your point, exciting! Please ignore everything I said 🙂

bdraco commented 1 year ago

Hey, I'll take a look at this in more detail tonight but just quickly:

The bulk of the time is spent creating Version objects

If I'm reading your graph correctly 9% of the time is spent on version objects? So, if I understand correctly, that's not the same as creating and not really the bulk of the time. But feel to correct me.

Edit: I see you already have a PR that proves your point, exciting! Please ignore everything I said 🙂

I was doing profiling on my laptop which means the ssl read / network latency overhead is much higher than what actually happens in production. The top of the profile calls to read is outsized compared to what actually happens on github actions when Home Assistant is building images. Since thats not within the control of pip and will vary greatly based on the network, I did not factor it in it for the purposes of comparing profiles.

notatallshaw commented 1 year ago

Well I got home and tried to reproduce this (had a lot of fun learning how to install Python 3.11 into an Alpine WSL 2 distro) but I could not produce a significant performance improvement with your branch as you suggest.

Some notes:

I used "install --dry-run" to remove wheel install from timing
I let one install run first to build up cache to reduce any network variance
I compared pypa-pip/main with bdraco-pip/version_cache

So first install pip/main and build cache

python -m pip install git+https://github.com/pypa/pip.git@main
python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt

And then collect timing:

time python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt

Which produced:

real    7m 51.29s
user    0m 45.65s
sys     0m 2.49s

And then installing your branch and collect timing:

python -m pip install git+https://github.com/bdraco/pip.git@version_cache
time python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt

Which produced:

real    7m 44.08s
user    0m 45.00s
sys     0m 2.73s

Which potentially is a real performance gain, but it seems less than 2%.

Can I suggest you test again on your side using pip/main instead of pip/23.2.1, perhaps the performance difference is something that already landed? Or maybe a problem in the way alpine has packaged/vendored pip?

bdraco commented 1 year ago

Thanks for testing. I think the major difference is going to be that we are running out image builds for arm under qemu which is much more performance sensitive.

notatallshaw commented 1 year ago

Thanks for testing. I think the major difference is going to be that we are running out image builds for arm under qemu which is much more performance sensitive.

But why just the version object creation? Wouldn't being on a worse performane machine cause everything to be slower? Have you tried running vanilla pip/main with no profiling enabled?

As a side note if performance is so critical to you here you might want to try https://github.com/prefix-dev/rip once it is released. I tried testing now but getting it to work under Alpine defeated me.

bdraco commented 1 year ago

It would probably be helpful to see an actual production run : https://github.com/home-assistant/core/actions/runs/6408628446/job/17398078969

But why just the version object creation?

It was the thing that made the most difference in testing

Wouldn't being on a worse performane machine cause everything to be slower?

Yes, everything will be worse

Have you tried running vanilla pip/main with no profiling enabled?

Yes

As a side note if performance is so critical to you here you might want to try prefix-dev/rip once it is released. I tried testing now but getting it to work under Alpine defeated me.

Thanks for that.

bdraco commented 1 year ago

I'm going to blow away my test env, and try again with your testing steps above.

bdraco commented 1 year ago

I'm going to try the following.

Create a fresh venv, and install the packages

python -m pip install git+https://github.com/pypa/pip.git@main
python -m pip install --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt  -r /usr/src/homeassistant/requirements_all.txt

Once the packages are in the venv, just do the resolution with:

python -m pip install git+https://github.com/pypa/pip.git@main
time python -m pip install --dry-run --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt  -r /usr/src/homeassistant/requirements_all.txt

python -m pip install git+https://github.com/bdraco/pip.git@version_cache
time python -m pip install --dry-run --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt  -r /usr/src/homeassistant/requirements_all.txt

bdraco commented 1 year ago

One thing that stands out is the image builds run with --no-cache-dir

bdraco commented 1 year ago

With the above testing version_cache is slower 🤦 ... will make sure its actually being used next

main once all packages installed:

real    0m59.382s
user    0m58.529s
sys 0m0.688s

version_cache once all packages installed:

real    1m0.171s
user    0m58.899s
sys 0m0.739s

bdraco commented 1 year ago

So it looks like the pip install isn't actually working ./lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py is not the version_cache file

bdraco commented 1 year ago

well I forgot to change the path.

it should be python -m pip install git+https://github.com/bdraco/pip.git@version_cache

bdraco commented 1 year ago

Even with that the file isn't getting updated?

notatallshaw commented 1 year ago

Weird, when I next get a chance I will confirm the correct version is installed in my tests, try doing "--force" on the install if you haven't already.

bdraco commented 1 year ago

I manually checked it out and installed it

with version_cache

real    0m20.139s
user    0m19.281s
sys 0m0.674s

notatallshaw commented 1 year ago

Well that's very definitive, I will rerun my tests as soon as I can, sorry if this was a rabbit hole I led you down.

notatallshaw commented 1 year ago

Retested this and made sure I was testing the right version this time:

$ python -m pip install git+https://github.com/bdraco/pip.git@version_cache --force
$ python -m pip -V
pip 23.3.dev0 from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pip (python 3.11)
$ grep "parse_version" /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py
from pip._vendor.packaging.version import parse_version
                wheel_version = parse_version(wheel.version)
        self._version = parse_version(".".join(str(c) for c in version_info))

Still get no performance improvement, I also tried installing directly from source, when I get a chance I will try running profiling on the two versions.

Otherwise I guess there is something funadmentally slow about this version parsing in this "arm under qemu" environment vs. an x86 environment 😕

notatallshaw commented 1 year ago

I can reproduce the performance difference! I had to take network out of the equation:

python -m pip download -d {package_directory} --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt
time python -m pip install --dry-run --no-index --find-links file://{absolute_package_directory}  -r requirements_all.txt

Implementing this methodology I get ~3m 40s on Pip main and ~2m 30s on your branch. Further this is the call graph I get on Pip main:

output

Looking at this it seems that the work for me is split into two categories, finding candidates and dealing with specifiers. And Version construction looks to be the majority of the latter, for some reason in your environment I guess finding candidates is not slow.

I am invesigating if a caching layer can be added at the Pip level instead of the lower vendor level like your PR proposed, also I'm looking at both finding candidates and specifiers. But this is way outside my expertise so if someone else would like to submit a PR instead that would be great.

bdraco commented 1 year ago

Excellent. I don't know this codebase well enough to tackle a solution at the pip level that would ultimately avoid all the creation of the same Version objects.

Hopefully someone is interested in taking on the challenge as I'm likely to strike out on another attempt at a solution given my lack of familiarity with this code and the design choices.

notatallshaw commented 1 year ago

Can you give this branch a try and let me know if you see performance improvement or not? https://github.com/notatallshaw/pip/tree/version_cache

It basically tries to implement caching at all the entry points into the packaging library (and also path to url functions): https://github.com/pypa/pip/compare/main...notatallshaw:pip:version_cache

I actually think the only two significant performance improvements it lands is the path to url function and caching specifier contains result. But if you do see performance improvements I will break out each cache layer addition, test them, and submit a PR if appropriate.

I have also made this issue on the packaging repo as there doesn't seem to be a way to cache one of the biggest offenders: https://github.com/pypa/packaging/issues/729

bdraco commented 1 year ago

Will give it a shot tomorrow after some sleep. 👍

notatallshaw commented 1 year ago

This conversation has got quite long and it is difficult to follow how to reproduce.

I am going to make issue/PR pairs that highlight very specific performance issues I have found and with detailed instructions on how to reproduce (without requiring Alpine). I am going to start with the simplest to solve and wait for feedback from Pip maintainers before continuing to more complex ones.

My first issue/PR pair is: https://github.com/pypa/pip/issues/12320 / https://github.com/pypa/pip/pull/12322

bdraco commented 1 year ago

Can you give this branch a try and let me know if you see performance improvement or not? notatallshaw/pip@version_cache

real    0m38.585s
user    0m37.348s
sys 0m0.738s

Its better than main for sure.

notatallshaw commented 1 year ago

Its better than main for sure.

Okay great, well at least I'm not completely on the wrong track.

However based on the feedback from the first PR I opened it's probably going to be less trivial than adding caches everywhere it's possible.

I'm going to investigate a bit more to see if I can come up with a more elegant solution to solving that first bottleneck I've identified.

But if anyone else wants to jump in with PRs I would be highly suppportive!

pfmoore commented 1 year ago

I'm very confused here. This issue talks about creating Version objects, which was reported to packaging but they rejected the proposed fix. But @notatallshaw is talking about the path_to_url function, which is a different component of the performance issue. And the reported test here that suggests that fixing path_to_url helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.

Is there a usable analysis anywhere (sorry, I don't know how to read callgrind data) that pinpoints what is going on here, explains why @notatallshaw is getting vastly faster results, and confirms that a significant portion of the reported 1-2 hours runtime is being taken by pip (as opposed to by network traffic, for example) and what parts of pip are at fault?

The screenshot says that 9.37% of something is taken up by an init call linked to version, and it's called 9 million times. That seems a lot, but a lot of what? Can we pinpoint what the 9 million "things" are whose versions are being calculated? Are we recalculating versions multiple times? Are we scanning candidates that we could avoid by tighter pinning of requirements? The reproduction @notatallshaw quotes is using fully-pinned requirements - is the same true of the original issue?

I don't personally have the time or resources to reproduce a 1-2 hour install that needs me to set up an Alpine Linux environment. And I'm uncomfortable extrapolating from a reproduction that takes 2-3 minutes - even if the latter demonstrates places we could improve performance, I don't know how we establish whether we're addressing the problem reported here.

I'm happy to support the general exercise of improving performance in pip where there's ways of doing so. But I think we need a better way of measuring performance improvements in that case (for example, I don't want to improve performance of 1000-dependency installs at the cost of hurting performance in the more common case of something like pip install requests which hits the local cache[^1]).

[^1]: More common for me, at least 🙂 I typically do that pip install requests multiple times a day, but I've never once done an install with more than 100 dependencies, much less 1000!

notatallshaw commented 1 year ago

I'm very confused here. This issue talks about creating Version objects, which was reported to packaging but they rejected the proposed fix. But @notatallshaw is talking about the path_to_url function, which is a different component of the performance issue. And the reported test here that suggests that fixing path_to_url helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.

That's one of the reasons I created a seperate issue, I am not sure fixing the performance bottle neck I see with path_to_url will help OP here at all. What I am able to see though is a relative performance improvement from OPs branch, that caches Version construction, with the reproducible steps I give in https://github.com/pypa/pip/issues/12320 (which I will note do not require Alpine or some special environment).

I can only go off what I can reproduce, and with a scenario where I am able to identify Version construction as taking a signiciant amount of the runtime, I also see other bottlenneckes, and my first attempt was to fix one of those was path_to_url as it looked simpler and easier to fix to me. With that fixed the relative performance impact of Version construction becomes more pronouced.

Long story short, as I already said in https://github.com/pypa/pip/issues/12314#issuecomment-1751504734, my plan was to fix performance bottlenecks that I can identify and reproduce.

I'm happy to support the general exercise of improving performance in pip where there's ways of doing so. But I think we need a better way of measuring performance improvements in that case (for example, I don't want to improve performance of 1000-dependency installs at the cost of hurting performance in the more common case of something like pip install requests

I agree, in my future PR work I plan to show the performance impact across 4 scenarios, which will probably be:

requests
pandas[performance, excel, html, parquet]
apache-airflow[all]
ha-core (i.e this example)

I am also looking to see if I can show memory impact as well as relative time performance improvement.

which hits the local cache1).

This was going to be my next scenario I tested, profiling with cache I see a lot of opportunities for improvement. However even with cache fully populated there are a lot of network calls involved, I planned to create an environment that used simpleindex to store all the relevant projects thus reducing the amount of time random network fluctions affected the relative performance. I have not yet constructed this environment to do this testing.

bdraco commented 1 year ago

I'm very confused here. This issue talks about creating Version objects, which was reported to packaging but they rejected the proposed fix. But @notatallshaw is talking about the path_to_url function, which is a different component of the performance issue. And the reported test here that suggests that fixing path_to_url helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.

Its confusing because there are multiple bottlenecks at scale. Some of the newer testing is using --find-links, we used to use that for Home Assistant installs but the performance was nearly O(n^2) because it had to match up each package in the remote http dir to each package being considered. When we used this approach HA builds took 5-6 hours instead of 1-2 hours. To get past that issue we now have https://github.com/bdraco/index-503 building https://wheels.home-assistant.io/musllinux-index/. While thats faster, 1-2 hours is still a long time so thats how we got here.

Is there a usable analysis anywhere (sorry, I don't know how to read callgrind data) that pinpoints what is going on here, explains why @notatallshaw is getting vastly faster results, and confirms that a significant portion of the reported 1-2 hours runtime is being taken by pip (as opposed to by network traffic, for example) and what parts of pip are at fault? Its under

In the use case here, everything is running through qemu so the run time of pip can be 2-3 orders of magnitude worse. With only avoiding --find-links and using the index above we saw almost a ~68% decrease in run time.

The screenshot says that 9.37% of something is taken up by an init call linked to version, and it's called 9 million times. That seems a lot, but a lot of what? Can we pinpoint what the 9 million "things" are whose versions are being calculated? Are we recalculating versions multiple times? Are we scanning candidates that we could avoid by tighter pinning of requirements? The reproduction @notatallshaw quotes is using fully-pinned requirements - is the same true of the original issue?

The construction of 9 million version objects was from re-processing the same versions of the same packages over and over for the purposes of comparing them. Since the version is constructed each time it needs to be compared, the object/memory is a significant bottleneck.

I don't personally have the time or resources to reproduce a 1-2 hour install that needs me to set up an Alpine Linux environment. And I'm uncomfortable extrapolating from a reproduction that takes 2-3 minutes - even if the latter demonstrates places we could improve performance, I don't know how we establish whether we're addressing the problem reported here.

That's completely understandable, its not surprising that pip isn't tested to scale at 1000s of packages as that test cycle is painful. The goal was generate a test case that was not painful to reproduce that encapsulated a portion of the run time problem. Testing is happening on systems that are 2-3 orders of magnitude faster than the actual use case under qemu so any improvement in seconds can quickly become an improvement in minutes in actual production.

notatallshaw commented 11 months ago

Took a look at this again today, I think there are two issues:

There are O(n²) issues in the way Pip can collect packages, I have an open PR (https://github.com/pypa/pip/pull/12327) but I think each way Pip collects packages needs to be looked at and fixed
There is an O(n²) issue in resolvelib, I have created an issue on resolvelib side here: https://github.com/sarugaku/resolvelib/issues/147

From what OP has explained I beleive that 2. is the issue OP is facing, but there is no hard data in this issue to proove it.

As it happens @pradyunsg has amazingly created a new benchmark tool https://github.com/pradyunsg/pip-resolver-benchmarks, and it should be possible to use this to directly show performance improvements and issues across a range of scenarios. When I have some time I will try and turn this use case into a benchmark scenario.

notatallshaw commented 11 months ago

I've created a draft PR on what I think might be a solution to 2: https://github.com/sarugaku/resolvelib/pull/148. But it is making an assumption about resolution convergence which I'm not sure is true, so I need to do a bunch of testing and carefully pick through failing test cases in resolvelib.

I've created an expirmental Pip branch here based on that PR: https://github.com/notatallshaw/pip/tree/optimize-resolve-steps. I've tested installing home assistant's full requirements, and I see the number of times the Verson object is intialized drops from ~10 million times to ~33 thousand times.

@bdraco it would be helpful, if you get a chance, to test against that branch and post back if you get any unexpected errors.

notatallshaw commented 6 months ago

The Verson parsing is now cached (https://github.com/pypa/pip/pull/12453), I would expect OPs performance issue to be much less drastic.

I ran an updated (Python 3.12) scenario based on the steps to reproduce in https://github.com/pypa/pip/issues/12320, as it largely takes out IO and sdist building as a factor. Here were my results (run on my very fast PC):

Pip Version	Time in seconds
Pip 23.3	266
Pip 24.0	116
Pip main	48

I think it's still worth investigating bottlenecks, but performance should be a lot better now, and since this thread was opened uv has been launched which aims for super fast performance with a pip-like interface, which took less than 1 second for this scenario on my computer.

pradyunsg commented 6 months ago

@notatallshaw Are you comparing against uv with a warm cache?

notatallshaw commented 6 months ago

@notatallshaw Are you comparing against uv with a warm cache?

No, this compares pip against uv for resolving a large number of pre-downloaded wheels using find-links, no cache, no installing.

Performance compared to pip main is about 100x faster, at least on my machine.

notatallshaw commented 6 months ago

FWIW that same home assistant scenario still using no cache but doing a regular install from PyPI on my machine uv complete in ~2 mins and pip main in ~16 mins.

A lot of the time is spent building sdists and downloading, both of which uv does in parallel, so it's hard to determine in that scenario if uv is actually being more efficent than pip or just faster due to the actions being done in parallel.

pradyunsg commented 6 months ago

It almost certainly is, since it's using partial downloads as well during the resolve phase (basically pip's fast-deps behaviour).

notatallshaw commented 3 months ago

There hasn't been any feedback from OPs since there have been 2 pip releases, both of which should have seen significant performance improvement, the latest of which directly tackling the complaint about construction of the Version object too many times.

And as no one was able to reproduce OPs issue exactly this thread has been left open. However given the work done, I feel it is worth closing at this point, and tackling any existing performance problems in a new issue.

And to be clear, there are definitely O(n²⁺) issues in pip, that rear their head under certain circumstances that are not always obvious to the pip maintainers. So please do raise new issues, most helpful with clear steps to reproduce.

pypa / pip