pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.49k stars 3.01k forks source link

Strange backtracking #12592

Open kotopesutility opened 6 months ago

kotopesutility commented 6 months ago

Description

pip chooses the freshest pre-release version of package as a dependency, but if it is not compatible with the freshest version of the next package dependency it chooses another version of the first package instead of the second one.

Expected behavior

According to backtracking algorithm description and this page pip should try to choose another version of incompatible dependency (if it is possible) instead of choosing other versions of already chosen dependencies

pip version

24.0

Python version

Python 3.12.2

OS

ALT Linux

How to Reproduce

pip3 install -I --dry-run --pre triad==0.9.5

Output

pip3 install -I --dry-run --verbose --pre triad==0.9.5 Would install appdirs-1.4.4 fs-2.4.16 fsspec-2024.3.1 numpy-1.26.4 pandas-2.2.1 pyarrow-15.0.2 python-dateutil-2.9.0.post0 pytz-2024.1 setuptools-69.2.0 six-1.16.0 triad-0.9.5 tzdata-2024.1

But if I run pip3 install --dry-run --verbose --ignore-installed --pre triad==0.9.5 numpy==2.0.0b1 Would install appdirs-1.4.4 fs-2.4.16 fsspec-2024.3.1 numpy-2.0.0b1 pandas-2.2.0rc0 pyarrow-14.0.2 python-dateutil-2.9.0.post0 pytz-2024.1 setuptools-69.2.0 six-1.16.0 triad-0.9.5 tzdata-2024.1

Both time pip first select numpy-2.0.0b1 and then takes pandas, but for the first command it drops numpy-2.0.0b1. Maybe because it is not compatible with pandas-2.2.1, but if so, then it should take pandas-2.2.0rc0

Code of Conduct

notatallshaw commented 6 months ago

Let's say you have three packages, a, b, and c and that your requirement is c==1, let's say that c depends on a and b and they both have two versions (1 and 2), a==2 is not compatible with b==2, but a==1 and b==2 are compatible, and b==2 and a==1 are compatible.

What is the correct solution to your requirements? It can be either c==1, a==1, b==2 or c==1, a==2, b==1 and there is no objective better solution between these two. In general pip does not guarantee a specific ordering, only that the final solution is valid, it is up to the user to provide additional constraints if they want certain characteristics out of the solution.

I'm not actually sure which part of the docs you are referring to from your link, but the example given there is meant to be just an illustrative example, not a specific guide to how the resolver works.

kotopesutility commented 6 months ago

What is the correct solution to your requirements? It can be either c==1, a==1, b==2 or c==1, a==2, b==1 and there is no objective better solution between these two.

Yes, that's clear. But what if user (developer, what ever) always wants the latest a and does not care about b version? But this is not the key question that interests me in this issue.

In general pip does not guarantee a specific ordering, only that the final solution is valid, it is up to the user to provide additional constraints if they want certain characteristics out of the solution.

I'm not actually sure which part of the docs you are referring to from your link, but the example given there is meant to be just an illustrative example, not a specific guide to how the resolver works.

I'm just trying to understand why pip varies from numpy but not from pandas?) numpy was chosen first, so its version should be locked in until there are other variants for other dependencies, right?

Btw, did I understand correctly, numpy was selected first according to the fourth element of the list?

notatallshaw commented 6 months ago

Yes, that's clear. But what if user (developer, what ever) always wants the latest a and does not care about b version? But this is not the key question that interests me in this issue.

There is no Python package specification for "latest" version, so there is no way to guarantee this. In this example Pip , failing all else, depends on user order, so if the user includes a and b in their dependencies then a should be a newer version than b, but there are lots of other steps during the backtracking where a different order might be chosen for a different reason.

I'm just trying to understand why pip varies from numpy but not from pandas?) numpy was chosen first, so its version should be locked in until there are other variants for other dependencies, right?

The short answer is I don't know.

The long answer is I would need to check step by step what pip chose, probably by walking through the code step by step, I can see from running pip install -I --dry-run --pre triad==0.9 -v that pip considered numpy-2.0.0b1 but then backtracked to numpy-1.26.4 .

I might be able to spend some time figuring this out if I spend a bit of time on it, but I'd want to understand what specific problem is this causing?

Btw, did I understand correctly, numpy was selected first according to the fourth element of the list?

There are many lists on this page, I don't know which one you mean. If you have a specific question about a specific line of the documention, please quote that line of the documention.

kotopesutility commented 6 months ago

There are many lists on this page, I don't know which one you mean. If you have a specific question about a specific line of the documention, please quote that line of the documention.

Oh, my fault, the second list from this section.

notatallshaw commented 6 months ago

Oh, my fault, the second list from this section.

Hmm, one thing that list from documentation is missing is whether the dependency is considered a "cause" of the conflict.

It would be between pinned and depth and the implementation can be seen here: https://github.com/pypa/pip/blob/24.0/src/pip/_internal/resolution/resolvelib/provider.py#L187

Unforutunately there is not a useful logging and it's very tricky to statically analyze the resolution from reading the code to determine why numpy got preferred to backtrack on. When I've investigated this sort of issue in the past I would tend to add a breakpoint near the the point where the preference is called in within resolvelib code, which is here: https://github.com/pypa/pip/blob/24.0/src/pip/_vendor/resolvelib/resolvers.py#L426, and I would write some snippets to investigate what is in unsatisfied_names and what value each name got from get_preference.

alryaz commented 6 months ago

N.B. All references made to PyPi history section are valid at the time of writing.

Okay, so this issue arose from a midnight-bound discussion between me and @kotopesutility , and I managed to reproduce it (Python 3.10.12, pip==24.0), however here's my hot take at explaining this (for anyone stumbling upon this issue):

pip3 install --pre triad==0.9.5

  1. setup.py of triad==0.9.5 mentions numpy without any versioning constraints.
  2. As pip iterates these requirements, it fetches metadata for the latest version of numpy available at PyPi - numpy==2.0.0b1.
  3. Next dependency to be fetched is pandas>=1.3.5, whose latest version available at PyPi - pandas==2.2.1.
  4. Other dependency metadata downloads follow suit until backtracking commences.
  5. pip encounters a hard pin of numpy<2 within pandas==2.2.1 package.
  6. Thus, the previously assumed numpy==2.0.0b1 will not suffice due to numpy<2 constraint discovered after downloading pandas==2.2.1.
  7. pip then fetches metadata for the latest version of numpy>=1.22.4,<2 available at PyPi - numpy==1.26.4.
  8. pip then continues to tree-iterate metadatum of the dependencies until everything is consistent.
  9. pip commenses download of *.whl files.

pip3 install --pre triad==0.9.5 numpy==2.0.0b1

  1. setup.py of triad==0.9.5 mentions numpy without any versioning constraints. Versioning constraint, however, is provided manually via CLI: numpy==2.0.0b1.
  2. As pip iterates these requirements, it fetches metadata for the latest version of numpy==2.0.0b1 available at PyPi - numpy==2.0.0b1.
  3. Next dependency to be fetched is pandas>=1.3.5, whose latest version available at PyPi - pandas==2.2.1.
  4. Other dependency metadata downloads follow suit until backtracking commences.
  5. pip encounters a hard pin of numpy<2 within pandas==2.2.1 package.
  6. Thus, the previously assumed pandas==2.2.1 will not suffice due to imperative numpy==2.0.0b1 constraint, and therefore another version of pandas>=1.3.5 must be discovered.
  7. pip then iterates downwards versions of pandas<2.2.1 available at PyPi:
    1. It fetches pandas==2.2.0, whose requirements incur a hard pin of numpy<2, therefore it will not suffice.
    2. Then it fetches pandas==2.2.0rc0, whose requirements allow any version of numpy>=1.26.4 (pinning numpy<2 occurred at PR https://github.com/pandas-dev/pandas/pull/56812 and is available on pandas>=2.2.0), therefore it will suffice.
  8. pip then continues to tree-iterate metadatum of the dependencies until everything is consistent.
  9. pip commenses download of *.whl files.

pip3 --version; python3 --version

pip 24.0 from /home/alryaz/.virtualenvs/test/lib/python3.10/site-packages/pip (python 3.10)
Python 3.10.12
notatallshaw commented 6 months ago

That looks about correct, and btw you can use pypi-timemachine if you need to reproduce a resolution from a given date.

What I still don't understand is was there some issues or undesirable behavior to this pip backtracking? I'm working on a couple of changes to the resolver, that may or may not ever land, but I don't know if they would help your situation or not, as I don't understand what was the issue.

alryaz commented 6 months ago

I assume the issue is mostly philosophical and/or documentational.

Upon encountering the numpy<2 constraint there are two approaches which may guarantee valid installation result:

N.B. Both options relate to the pip3 install --pre triad==0.9.5 that doesn't impose additional constraints over numpy.

I might have missed the documentation point that determines which of these approaches is implemented, and while it isn't an objectively crucial point, eventual consistency may be achieved using two strategies, with two different resulting sets of packages.

<...> pip does not guarantee a specific ordering, only that the final solution is valid, it is up to the user to provide additional constraints if they want certain characteristics out of the solution.

I suppose it heavily depends on the definition of a user. If the user are developer, they are in control over how to define requirements for installation. If the user are an end user, they are required to be knowledgeable about possible dependency resolution paths to infer what kind of installation-time pinning they are required to do manually.

Edit: O(...) of both approaches may be equal (didn't do any calculations). Edit 2: I used the terms breadth-/depth-first frivolously; however their approaches seem to be ideologically closest to whatever pip's dependency resolver is doing

notatallshaw commented 6 months ago

It's an implementation detail but the pip resolver (resolvelib) does a breadth first search of the users requirements followed by a depth first search of further dependencies.

FYI, in general finding an dependency specify an upper bound on a package the resolver has already pinned and that pinned version is higher than the upper bound is problematic for most resolution algorithms. And often produces either pathological performance or versions that are valid but the user considers "bad" in some sense. I have been working on an idea to fix this but it involves a significant extension to the resolution algorithm and may never be accepted because it is a previously untested idea.

That said, I still don't understand what you find bad about the solution pip found. Can you please be more explicit in what issue you have with the solution found.

Also, I would be interested to see if uv gives the same or different versions to these same requirements. I will try when I am near a computer.

notatallshaw commented 6 months ago

Btw this is what uv installs, is it more or less to your liking? Or the same? And why?

$ uv pip install --reinstall --dry-run --pre triad==0.9.5
Resolved 12 packages in 566ms
Would download 11 packages
Would uninstall 1 package
Would install 12 packages
 + appdirs==1.4.4
 + fs==2.4.16
 + fsspec==2024.3.1
 + numpy==2.0.0b1
 + pandas==2.2.0rc0
 + pyarrow==14.0.2
 + python-dateutil==2.9.0.post0
 + pytz==2024.1
 - setuptools==65.5.0
 + setuptools==69.2.0
 + six==1.16.0
 + triad==0.9.5
 + tzdata==2024.1
notatallshaw commented 3 months ago

Following up on this, I'm still not understanding what you think about pip's choices resulting in a bad state, what you consider a good state (actual package versions) and why the good state is good and the bad state is bad.

Pip’s resolver, in general, is looking for “any” solution, it is trying from newer packages and going to older packages, but there is no guarantee about which package will be new or old when there are conflicts.

It is perhaps something that could be improved in the resolver code when there is a clear way of identifying a better state, but there would need to be some general objective way of measuring better.