Closed bdraco closed 3 months ago
Hey, I'll take a look at this in more detail tonight but just quickly:
The bulk of the time is spent creating Version objects
If I'm reading your graph correctly 9% of the time is spent on version objects? So, if I understand correctly, that's not the same as creating and not really the bulk of the time. But feel to correct me.
Edit: I see you already have a PR that proves your point, exciting! Please ignore everything I said 🙂
Hey, I'll take a look at this in more detail tonight but just quickly:
The bulk of the time is spent creating Version objects
If I'm reading your graph correctly 9% of the time is spent on version objects? So, if I understand correctly, that's not the same as creating and not really the bulk of the time. But feel to correct me.
Edit: I see you already have a PR that proves your point, exciting! Please ignore everything I said 🙂
I was doing profiling on my laptop which means the ssl read / network latency overhead is much higher than what actually happens in production. The top of the profile calls to read
is outsized compared to what actually happens on github actions when Home Assistant is building images. Since thats not within the control of pip and will vary greatly based on the network, I did not factor it in it for the purposes of comparing profiles.
Well I got home and tried to reproduce this (had a lot of fun learning how to install Python 3.11 into an Alpine WSL 2 distro) but I could not produce a significant performance improvement with your branch as you suggest.
Some notes:
So first install pip/main and build cache
python -m pip install git+https://github.com/pypa/pip.git@main
python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt
And then collect timing:
time python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt
Which produced:
real 7m 51.29s
user 0m 45.65s
sys 0m 2.49s
And then installing your branch and collect timing:
python -m pip install git+https://github.com/bdraco/pip.git@version_cache
time python -m pip install --dry-run --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt
Which produced:
real 7m 44.08s
user 0m 45.00s
sys 0m 2.73s
Which potentially is a real performance gain, but it seems less than 2%.
Can I suggest you test again on your side using pip/main instead of pip/23.2.1, perhaps the performance difference is something that already landed? Or maybe a problem in the way alpine has packaged/vendored pip?
Thanks for testing. I think the major difference is going to be that we are running out image builds for arm under qemu which is much more performance sensitive.
Thanks for testing. I think the major difference is going to be that we are running out image builds for arm under qemu which is much more performance sensitive.
But why just the version object creation? Wouldn't being on a worse performane machine cause everything to be slower? Have you tried running vanilla pip/main with no profiling enabled?
As a side note if performance is so critical to you here you might want to try https://github.com/prefix-dev/rip once it is released. I tried testing now but getting it to work under Alpine defeated me.
It would probably be helpful to see an actual production run : https://github.com/home-assistant/core/actions/runs/6408628446/job/17398078969
But why just the version object creation?
It was the thing that made the most difference in testing
Wouldn't being on a worse performane machine cause everything to be slower?
Yes, everything will be worse
Have you tried running vanilla pip/main with no profiling enabled?
Yes
As a side note if performance is so critical to you here you might want to try prefix-dev/rip once it is released. I tried testing now but getting it to work under Alpine defeated me.
Thanks for that.
I'm going to blow away my test env, and try again with your testing steps above.
I'm going to try the following.
Create a fresh venv, and install the packages
python -m pip install git+https://github.com/pypa/pip.git@main
python -m pip install --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt -r /usr/src/homeassistant/requirements_all.txt
Once the packages are in the venv, just do the resolution with:
python -m pip install git+https://github.com/pypa/pip.git@main
time python -m pip install --dry-run --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt -r /usr/src/homeassistant/requirements_all.txt
python -m pip install git+https://github.com/bdraco/pip.git@version_cache
time python -m pip install --dry-run --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements.txt -r /usr/src/homeassistant/requirements_all.txt
One thing that stands out is the image builds run with --no-cache-dir
With the above testing version_cache
is slower 🤦 ... will make sure its actually being used next
main once all packages installed:
real 0m59.382s
user 0m58.529s
sys 0m0.688s
version_cache once all packages installed:
real 1m0.171s
user 0m58.899s
sys 0m0.739s
So it looks like the pip install isn't actually working ./lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py
is not the version_cache
file
well I forgot to change the path.
it should be python -m pip install git+https://github.com/bdraco/pip.git@version_cache
Even with that the file isn't getting updated?
Weird, when I next get a chance I will confirm the correct version is installed in my tests, try doing "--force" on the install if you haven't already.
I manually checked it out and installed it
with version_cache
real 0m20.139s
user 0m19.281s
sys 0m0.674s
Well that's very definitive, I will rerun my tests as soon as I can, sorry if this was a rabbit hole I led you down.
Retested this and made sure I was testing the right version this time:
$ python -m pip install git+https://github.com/bdraco/pip.git@version_cache --force
$ python -m pip -V
pip 23.3.dev0 from /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pip (python 3.11)
$ grep "parse_version" /root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pip/_internal/resolution/resolvelib/candidates.py
from pip._vendor.packaging.version import parse_version
wheel_version = parse_version(wheel.version)
self._version = parse_version(".".join(str(c) for c in version_info))
Still get no performance improvement, I also tried installing directly from source, when I get a chance I will try running profiling on the two versions.
Otherwise I guess there is something funadmentally slow about this version parsing in this "arm under qemu" environment vs. an x86 environment 😕
I can reproduce the performance difference! I had to take network out of the equation:
python -m pip download -d {package_directory} --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r requirements_all.txt
time python -m pip install --dry-run --no-index --find-links file://{absolute_package_directory} -r requirements_all.txt
Implementing this methodology I get ~3m 40s on Pip main and ~2m 30s on your branch. Further this is the call graph I get on Pip main:
Looking at this it seems that the work for me is split into two categories, finding candidates and dealing with specifiers. And Version construction looks to be the majority of the latter, for some reason in your environment I guess finding candidates is not slow.
I am invesigating if a caching layer can be added at the Pip level instead of the lower vendor level like your PR proposed, also I'm looking at both finding candidates and specifiers. But this is way outside my expertise so if someone else would like to submit a PR instead that would be great.
Excellent. I don't know this codebase well enough to tackle a solution at the pip level that would ultimately avoid all the creation of the same Version objects.
Hopefully someone is interested in taking on the challenge as I'm likely to strike out on another attempt at a solution given my lack of familiarity with this code and the design choices.
Can you give this branch a try and let me know if you see performance improvement or not? https://github.com/notatallshaw/pip/tree/version_cache
It basically tries to implement caching at all the entry points into the packaging library (and also path to url functions): https://github.com/pypa/pip/compare/main...notatallshaw:pip:version_cache
I actually think the only two significant performance improvements it lands is the path to url function and caching specifier contains result. But if you do see performance improvements I will break out each cache layer addition, test them, and submit a PR if appropriate.
I have also made this issue on the packaging repo as there doesn't seem to be a way to cache one of the biggest offenders: https://github.com/pypa/packaging/issues/729
Will give it a shot tomorrow after some sleep. 👍
This conversation has got quite long and it is difficult to follow how to reproduce.
I am going to make issue/PR pairs that highlight very specific performance issues I have found and with detailed instructions on how to reproduce (without requiring Alpine). I am going to start with the simplest to solve and wait for feedback from Pip maintainers before continuing to more complex ones.
My first issue/PR pair is: https://github.com/pypa/pip/issues/12320 / https://github.com/pypa/pip/pull/12322
Can you give this branch a try and let me know if you see performance improvement or not? notatallshaw/pip@
version_cache
real 0m38.585s
user 0m37.348s
sys 0m0.738s
Its better than main
for sure.
Its better than
main
for sure.
Okay great, well at least I'm not completely on the wrong track.
However based on the feedback from the first PR I opened it's probably going to be less trivial than adding caches everywhere it's possible.
I'm going to investigate a bit more to see if I can come up with a more elegant solution to solving that first bottleneck I've identified.
But if anyone else wants to jump in with PRs I would be highly suppportive!
I'm very confused here. This issue talks about creating Version
objects, which was reported to packaging
but they rejected the proposed fix. But @notatallshaw is talking about the path_to_url
function, which is a different component of the performance issue. And the reported test here that suggests that fixing path_to_url
helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.
Is there a usable analysis anywhere (sorry, I don't know how to read callgrind data) that pinpoints what is going on here, explains why @notatallshaw is getting vastly faster results, and confirms that a significant portion of the reported 1-2 hours runtime is being taken by pip (as opposed to by network traffic, for example) and what parts of pip are at fault?
The screenshot says that 9.37% of something is taken up by an init call linked to version, and it's called 9 million times. That seems a lot, but a lot of what? Can we pinpoint what the 9 million "things" are whose versions are being calculated? Are we recalculating versions multiple times? Are we scanning candidates that we could avoid by tighter pinning of requirements? The reproduction @notatallshaw quotes is using fully-pinned requirements - is the same true of the original issue?
I don't personally have the time or resources to reproduce a 1-2 hour install that needs me to set up an Alpine Linux environment. And I'm uncomfortable extrapolating from a reproduction that takes 2-3 minutes - even if the latter demonstrates places we could improve performance, I don't know how we establish whether we're addressing the problem reported here.
I'm happy to support the general exercise of improving performance in pip where there's ways of doing so. But I think we need a better way of measuring performance improvements in that case (for example, I don't want to improve performance of 1000-dependency installs at the cost of hurting performance in the more common case of something like pip install requests
which hits the local cache[^1]).
[^1]: More common for me, at least 🙂 I typically do that pip install requests
multiple times a day, but I've never once done an install with more than 100 dependencies, much less 1000!
I'm very confused here. This issue talks about creating
Version
objects, which was reported topackaging
but they rejected the proposed fix. But @notatallshaw is talking about thepath_to_url
function, which is a different component of the performance issue. And the reported test here that suggests that fixingpath_to_url
helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.
That's one of the reasons I created a seperate issue, I am not sure fixing the performance bottle neck I see with path_to_url
will help OP here at all. What I am able to see though is a relative performance improvement from OPs branch, that caches Version
construction, with the reproducible steps I give in https://github.com/pypa/pip/issues/12320 (which I will note do not require Alpine or some special environment).
I can only go off what I can reproduce, and with a scenario where I am able to identify Version
construction as taking a signiciant amount of the runtime, I also see other bottlenneckes, and my first attempt was to fix one of those was path_to_url
as it looked simpler and easier to fix to me. With that fixed the relative performance impact of Version
construction becomes more pronouced.
Long story short, as I already said in https://github.com/pypa/pip/issues/12314#issuecomment-1751504734, my plan was to fix performance bottlenecks that I can identify and reproduce.
I'm happy to support the general exercise of improving performance in pip where there's ways of doing so. But I think we need a better way of measuring performance improvements in that case (for example, I don't want to improve performance of 1000-dependency installs at the cost of hurting performance in the more common case of something like
pip install requests
I agree, in my future PR work I plan to show the performance impact across 4 scenarios, which will probably be:
I am also looking to see if I can show memory impact as well as relative time performance improvement.
which hits the local cache1).
This was going to be my next scenario I tested, profiling with cache I see a lot of opportunities for improvement. However even with cache fully populated there are a lot of network calls involved, I planned to create an environment that used simpleindex to store all the relevant projects thus reducing the amount of time random network fluctions affected the relative performance. I have not yet constructed this environment to do this testing.
I'm very confused here. This issue talks about creating
Version
objects, which was reported topackaging
but they rejected the proposed fix. But @notatallshaw is talking about thepath_to_url
function, which is a different component of the performance issue. And the reported test here that suggests that fixingpath_to_url
helps is reporting 38-second runtimes, which is nothing like the 1-2 hours this report is about. And the reproduction @notatallshaw managed here is around 3 minutes, again a far cry from 1-2 hours.
Its confusing because there are multiple bottlenecks at scale. Some of the newer testing is using --find-links
, we used to use that for Home Assistant installs but the performance was nearly O(n^2) because it had to match up each package in the remote http dir to each package being considered. When we used this approach HA builds took 5-6 hours instead of 1-2 hours. To get past that issue we now have https://github.com/bdraco/index-503 building https://wheels.home-assistant.io/musllinux-index/. While thats faster, 1-2 hours is still a long time so thats how we got here.
Is there a usable analysis anywhere (sorry, I don't know how to read callgrind data) that pinpoints what is going on here, explains why @notatallshaw is getting vastly faster results, and confirms that a significant portion of the reported 1-2 hours runtime is being taken by pip (as opposed to by network traffic, for example) and what parts of pip are at fault? Its under
In the use case here, everything is running through qemu so the run time of pip can be 2-3 orders of magnitude worse. With only avoiding --find-links
and using the index above we saw almost a ~68% decrease in run time.
The screenshot says that 9.37% of something is taken up by an init call linked to version, and it's called 9 million times. That seems a lot, but a lot of what? Can we pinpoint what the 9 million "things" are whose versions are being calculated? Are we recalculating versions multiple times? Are we scanning candidates that we could avoid by tighter pinning of requirements? The reproduction @notatallshaw quotes is using fully-pinned requirements - is the same true of the original issue?
The construction of 9 million version objects was from re-processing the same versions of the same packages over and over for the purposes of comparing them. Since the version is constructed each time it needs to be compared, the object/memory is a significant bottleneck.
I don't personally have the time or resources to reproduce a 1-2 hour install that needs me to set up an Alpine Linux environment. And I'm uncomfortable extrapolating from a reproduction that takes 2-3 minutes - even if the latter demonstrates places we could improve performance, I don't know how we establish whether we're addressing the problem reported here.
That's completely understandable, its not surprising that pip isn't tested to scale at 1000s of packages as that test cycle is painful. The goal was generate a test case that was not painful to reproduce that encapsulated a portion of the run time problem. Testing is happening on systems that are 2-3 orders of magnitude faster than the actual use case under qemu so any improvement in seconds can quickly become an improvement in minutes in actual production.
Took a look at this again today, I think there are two issues:
From what OP has explained I beleive that 2. is the issue OP is facing, but there is no hard data in this issue to proove it.
As it happens @pradyunsg has amazingly created a new benchmark tool https://github.com/pradyunsg/pip-resolver-benchmarks, and it should be possible to use this to directly show performance improvements and issues across a range of scenarios. When I have some time I will try and turn this use case into a benchmark scenario.
I've created a draft PR on what I think might be a solution to 2: https://github.com/sarugaku/resolvelib/pull/148. But it is making an assumption about resolution convergence which I'm not sure is true, so I need to do a bunch of testing and carefully pick through failing test cases in resolvelib.
I've created an expirmental Pip branch here based on that PR: https://github.com/notatallshaw/pip/tree/optimize-resolve-steps. I've tested installing home assistant's full requirements, and I see the number of times the Verson
object is intialized drops from ~10 million times to ~33 thousand times.
@bdraco it would be helpful, if you get a chance, to test against that branch and post back if you get any unexpected errors.
The Verson
parsing is now cached (https://github.com/pypa/pip/pull/12453), I would expect OPs performance issue to be much less drastic.
I ran an updated (Python 3.12) scenario based on the steps to reproduce in https://github.com/pypa/pip/issues/12320, as it largely takes out IO and sdist building as a factor. Here were my results (run on my very fast PC):
Pip Version | Time in seconds |
---|---|
Pip 23.3 | 266 |
Pip 24.0 | 116 |
Pip main | 48 |
I think it's still worth investigating bottlenecks, but performance should be a lot better now, and since this thread was opened uv has been launched which aims for super fast performance with a pip-like interface, which took less than 1 second for this scenario on my computer.
@notatallshaw Are you comparing against uv with a warm cache?
@notatallshaw Are you comparing against uv with a warm cache?
No, this compares pip against uv for resolving a large number of pre-downloaded wheels using find-links, no cache, no installing.
Performance compared to pip main is about 100x faster, at least on my machine.
FWIW that same home assistant scenario still using no cache but doing a regular install from PyPI on my machine uv complete in ~2 mins and pip main in ~16 mins.
A lot of the time is spent building sdists and downloading, both of which uv does in parallel, so it's hard to determine in that scenario if uv is actually being more efficent than pip or just faster due to the actions being done in parallel.
It almost certainly is, since it's using partial downloads as well during the resolve phase (basically pip's fast-deps behaviour).
There hasn't been any feedback from OPs since there have been 2 pip releases, both of which should have seen significant performance improvement, the latest of which directly tackling the complaint about construction of the Version object too many times.
And as no one was able to reproduce OPs issue exactly this thread has been left open. However given the work done, I feel it is worth closing at this point, and tackling any existing performance problems in a new issue.
And to be clear, there are definitely O(n2+) issues in pip, that rear their head under certain circumstances that are not always obvious to the pip maintainers. So please do raise new issues, most helpful with clear steps to reproduce.
Description
This is a followup to #10788
Attached is a callgrind pip_callgrind.zip
The bulk of the time is spent creating
Version
objectsExpected behavior
No response
pip version
pip 23.2.1 from /root/ztest/lib/python3.11/site-packages/pip (python 3.11)
Python version
3.11
OS
alpine 3.18 linux
How to Reproduce
Must use alpine 3.18 linux
python3 -m cProfile -o all.pstats -m pip install --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements_all.txt
https://github.com/home-assistant/core/blob/dev/requirements_all.txt
or
Dockerfile https://github.com/home-assistant/core/blob/dev/Dockerfile
Output
No response
Code of Conduct