pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.51k stars 3.02k forks source link

Improve UX and Performance of Install step #12712

Open notatallshaw opened 5 months ago

notatallshaw commented 5 months ago

What's the problem this feature will solve?

At the moment when the final install step starts pip gives no output what it is doing, in some real world cases (e.g. large pytorch installations or airflow installs) this steps can take over 30 seconds on fast machines, so minutes on slow machines. The user is left wondering if anything is happening.

Describe the solution you'd like

I would like to see the following improvements:

  1. Log a message that pip is starting to install packages
  2. Present a progress bar that tracks the number of packages installed out of the total packages to be installed
  3. Improve any obvious performance bottlenecks (see follow up post with profile)
  4. Run installs in parallel (made seperate issue https://github.com/pypa/pip/issues/12742)

Alternative Solutions

I think at a bare minimum there should be a log message that lets the user know what's happening.

Additional context

uv runs installs in parallel, and following their issue tracker it does not appear to be problematic, to do this a cli option to control the maximum number would need to be added, the same as how there is a PR for parallel downloads to do this.

Code of Conduct

notatallshaw commented 5 months ago

This scenario is artifically constructed to best profile the installer code by removing the need to download, build sdists, or resolve:

  1. python3.12 -m venv .venv
  2. source .venv/bin/activate
  3. <install latest/dev pip>
  4. wget https://raw.githubusercontent.com/apache/airflow/constraints-2.9.1/constraints-3.12.txt
  5. python -m pip download -d downloads -r constraints-3.12.txt
  6. cd downloads
  7. for file in $(ls *.tar.gz); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
  8. for file in $(ls *.zip); do pip wheel --no-deps "$file" && mv "$file" "$file".built ; done
  9. cd -
  10. python -m pip install --only-binary=:all: --no-index --ignore-installed --no-deps --find-links file://${PWD}/downloads -r constraints-3.12.txt

I ran with and without --dry-run to see the timing difference: Dry Run: 32s Regular Install: 144s

I profiled with and without --dry-run to see the profile difference:

Dry Run Profile ![airflow-no-deps-dry-run-install](https://github.com/pypa/pip/assets/8070352/82c03956-ca71-4b02-8876-dff0d76e571d)
Regular Install Profile ![airflow-no-deps-install](https://github.com/pypa/pip/assets/8070352/bf9d4c25-75d7-438d-ae2f-ea69acfa68b0)

There are some clear hotspots here, I will take a look when I have time if there are some easy ways to reduce those hotspots if no one else does.

ichard26 commented 5 months ago

The get_dist_name() hot spot should be vastly improved by https://github.com/pypa/pip/pull/12656 FWIW. I scheduled the PR for 24.2 as it feels a bit risky to ship in 24.1 final. Please say something if anyone feels differently.

pfmoore commented 5 months ago

I see no issues with the UI proposal, but I'd want parallel installs to be a separate feature. I can imagine pathological cases where things could break when installing in parallel, and while the experience of uv is encouraging (as is the fact that normal cases are clearly safe) my instinct is that every pathological case is being exercised by some user of pip, somewhere. So we should isolate the risk here by making it a separate feature.

notatallshaw commented 5 months ago

The get_dist_name() hot spot should be vastly improved by #12656 FWIW. I scheduled the PR for 24.2 as it feels a bit risky to ship in 24.1 final. Please say something if anyone feels differently.

Great, I'll reprofile with this PR. I personally wasn't imagining any of these ideas would land for 24.1.

I'd want parallel installs to be a separate feature

Agree, I'll make a seperate issue for that.

Honestly, the others I feel like I could make PRs that safely improve pip, I'm unsure about parralel installs, I think it would at a minimum carefully need to look at what current multiple installs tests there are and potentially expanding them to have a good matrix of different possibilities.

notatallshaw commented 4 months ago

Log a message that pip is starting to install packages

Btw, I was looking at this recently because I noticed pip does tell you it's installing packages. The specific scenario I was seeing was the following:

  1. You install a large number of packages
  2. You then install a large number of semi-overlapping packages

On step two this produces the following behavior:

  1. Packages are resolved and pip tells you what packages it is going to install
  2. Pip then quickly uninstalls old packages, filling up the screen
  3. There is a long wait with no update on the screen while pip is installing
  4. Pip then lists all packages it installed

The real world situaiton this happens is installing large machine learning packages, particularly because you install a bunch of packages from the pytorch index, and then install a bunch of packages from pypi.

I think there are a couple of possible solutions:

  1. Re-order or add additional messages, e.g. move or add and "install" message after the uninstalls have completed
  2. Add progress bars to both uninstalling packages and installing packages, so it's clear pip is doing things

I will take a look at PRs when I have a chance.

ichard26 commented 4 months ago

Caching the result of utils.compatibility_tags.get_supported() in the resolver factory should be another easy win[^1] (~3% or 4s in the example above)

https://github.com/pypa/pip/blob/86b8b23c1eaaac5c420a28519daf91de830f1182/src/pip/_internal/resolution/resolvelib/factory.py#L608-L612

I'll submit a PR when I get the chance.

[^1]: I strongly suspect that get_supported() is only "slow" (as in, 1-5ms) on Linux due to the large amount of supported tags per system.

ichard26 commented 3 months ago

While taking a look at https://github.com/pypa/pip/pull/12601, I was curious to how easy it would be to add an installation progress bar. The progress bar was pretty trivial to add by extending the pre-existing progress logic... However, it did not play nicely with the logging stack, so any intervening logs would break the progress bar. To fix this, I had to redo how rich was initialized in the logging stack which took a bit :slightly_smiling_face:

Anyway, here's a demo:

Screencast from 2024-07-15 22-31-35.webm

What do you think @notatallshaw?

[^1]: Ideally, the presentation logic would simply disable the progress bar outright when writing to a non-TTY, but that's a future thing to think about.

ichard26 commented 3 months ago

Hmm, it would definitely look less rough if I left-justified the package name. Here's another demo, but the package name is justified to the longest name length seen so far (as doing it properly feels like going against the API contract of pips' progress bars).

Screencast from 2024-07-15 23-06-07.webm

It does kinda look weird. Perhaps after the bar?

Screencast from 2024-07-15 23-11-40.webm

I think this looks the best out of all of them :)

notatallshaw commented 3 months ago

I'll note that your mental model for how pip installs packages is wrong. The uninstalls occur "on-demand" right before its replacement package is about to be installed (i.e. the uninstalls/installs are interwoven), so an uninstallation progress bar doesn't really make sense.

Ah, I see, my confusion is that is how pip's current logging displays what is happening. It logs all uninstalls, and then logs what packages it has installed, sometimes there can be a significant time between the last uninstall message and the install message, giving this impression.

I also chose to include the package currently being installed in the progress bar. Yes, in most situations, the per-package installation time is so low that most packages are never shown to the user (like in the demo), but there are exceptions. If we're installing some massive package, it'd be nice to let the user know we're stuck on $package. I don't feel strongly about this though so I'm fine dropping it.

I agree, if you want to try large packages installation where individual packages will be noticable you can do: pip install torch torchvision torchaudio

I think this looks the best out of all of them :)

Yes, I think anything left of the progress bar should be fixed width, and ideally not updating at all. At least in left to right English having the left hand side update feels like I need to keep rereading the whole line, but the right hand side updating just feels like I need to look at the right hand side to check updates.

Once you have a PR I'm happy to throw some difficult scenarios against it.

notatallshaw commented 2 months ago

Okay, since I opened this issue there's been a lot of improvement to install performance of a lot of wheels, here is my synthetic test:

  1. python3.12 -m venv .venv
  2. source .venv/bin/activate
  3. <install latest/dev pip>
  4. wget https://raw.githubusercontent.com/apache/airflow/constraints-2.9.1/constraints-3.12.txt
  5. python -m pip wheel -w wheels -r constraints-3.12.txt
  6. time python -m pip install --only-binary=:all: --no-index --ignore-installed --no-deps --find-links file://${PWD}/wheels -r constraints-3.12.txt

On pip 24.1.2:

real 2m23.338s user 2m8.488s sys 0m12.523s

On pip main (effectively 24.2 right now):

real 1m23.565s user 1m11.482s sys 0m9.681s

Here is the new call graph: ![airflow-dry-install-main](https://github.com/user-attachments/assets/258aeeb9-ea8e-483a-b825-67e6833fdc21)

In this synthetic example ~50% of them time is now spent on O(n2+) issues in resolution and ~50% of the time is spent doing wheel specific stuff. It feels like both have algorithmic or caching opportunities. When I get a chance I will take a look.

notatallshaw commented 2 months ago

In my synthetic test I notice ~30% of the time is spent on compile_file. I notice that the standard library compile_dir will create a process pool when it can to speed things up, I wonder if it makes sense to use compile_dir on each root package directory installed, and then verify the pyc files are created as expected?

notatallshaw commented 2 months ago

FYI, I beleive most other installers "optimize" this step by not compiling by default.

notatallshaw commented 2 months ago

As discussed (https://github.com/pypa/pip/issues/12920) one helpful UX improvement would make it clear that Python is compiling byte code in pip's output when compiling is enabled, e.g. "Installing and Compiling".

tyler-suard-parker commented 1 month ago

Hmm, it would definitely look less rough if I left-justified the package name. Here's another demo, but the package name is justified to the longest name length seen so far (as doing it properly feels like going against the API contract of pips' progress bars).

Screencast.from.2024-07-15.23-06-07.webm It does kinda look weird. Perhaps after the bar?

Screencast.from.2024-07-15.23-11-40.webm I think this looks the best out of all of them :)

@ichard26 This looks pretty great. Did you ever create a PR? I would like to help if I can.