pypa / pip

The Python package installer
https://pip.pypa.io/
MIT License
9.48k stars 3.01k forks source link

Add a resolver option to use the specified minimum version for a dependency #8085

Open dhellmann opened 4 years ago

dhellmann commented 4 years ago

What's the problem this feature will solve?

I would like to be able to install my project using its "lower bounds" requirements and run the test suite to ensure that (a) I have those lower bounds specified properly and (b) the tests pass.

Describe the solution you'd like

A new command line option --prefer-minimum-versions would change the resolver behavior to choose the earliest version supported by a requirement specification. For example if versions 1.0 and 2.0 of package foo are available and the specification is foo>=1.0 then when the flag is used version 1.0 would be installed and when the flag is not used version 2.0 would be installed.

Large applications such as OpenStack have a lot of dependencies and verifying the accuracy of the complete set is complex. Providing a way to install the earliest set of packages expected to work would make this easier.

Alternative Solutions

The existing constraints file option does help, but building a valid constraints file is complicated.

Additional context

There is a recent discussion about this need on the openstack-discuss mailing list.

I will work on the implementation.

uranusjr commented 2 years ago

I’m not sure I understand your “package author” problem. Is it that (say) pacakge A’s dependency B declares a transitive dependency without a lower bound (say C>10), pip install A in minimal-version-selection mode would fail? If that’s the case, package A can always amend that by adding the lower bound to C themselves to alluviate the issue if package B doesn’t resolve the issue.

But I feel arguably package A doesn’t need to worry about this. Judging from the messages above, this option is not particularly useful for people running pip install directly, but mostly (always?) for people looking for a way to validate the dependency set containing package A. And for that purpose there are already many established ways to add the additional lower bound (e.g. constraints).


Somewhat off-topic: If we introduce this functionality and push lower-bound checking to become best practice for both package and application authors, we might also have a good chance to introduce constraints as standard package metadata (“if this package is installed, make sure it fits the range; if it’s not, don’t bother installing it”) and solve the package author problem.

thomasf commented 2 years ago

Realistically, what version selection algorithms besides minimum or maximum is even viable? Sounds to me that a --strategy=... might be an over preparedness for something that likely won't be needed?

Asday commented 2 years ago

Only minimum is inarguably viable, maximum has shown itself to be completely untenable with any reasonably deep dependency tree, but the problem is that maximum and some weird hybrid of maximum and "whatever's currently installed I don't care" is what's been used in the industry with Python for decades now, so pulling off the band-aid and pouring alcohol on the wound by going straight to minimum with no soft landing is going to be painful.

Painful and expensive, such that companies will likely not do it for incumbent projects. Hence the discussion.

henryiii commented 2 years ago

Minimum is only useful for testing, you wouldn't use it in production. In production, maximum + a lock file works fine and is the only viable solution. Newer versions are more likely to work than older versions; while there's a chance newer versions might break you, older versions are nearly guaranteed to break you - that's why people release updates, to fix things! Newer versions of Python, newer versions of OSs (like macOS 11), newer architectures (like Apple Silicon) all are much more likely to work with the latest version of packages and not with older versions. They also will not have wheels for newer Pythons and architectures, very few packages go back and produce wheels for old versions after new Pythons drop. The way to avoid breakage is to use a lock file, and that's what is done. This is true in other languages too, by the way.

weird hybrid of maximum and "whatever's currently installed I don't care"

AFAIK, most users use virtualenvs, and reinstall which gives you the latest. Or they use a locking package manager (poetry, PDM, Pipenv), and use update, which again gets the latest versions, just like npm or bundler. The only "hybrid" are users who don't know better and just pip install -U occasionally, mostly students and new users.

Realistically, what version selection algorithms besides minimum or maximum is even viable

Personally, I'd like a "maximum without backsolving if a upper cap is used" algorithm. Currently, if I install a which depends on b<2, and b>=2, the solver starts looking back at a's history to see if there's a version that doesn't have a cap on b. And an older version of a is even less likely to support b>=2! Even though that's technically a mathematical "solution" to the solve. Same thing with Python version.

If there was some way to standardize lock files and distribute them in wheels, then maybe there could be a "prefer locked" solve, which would be ideal for application installs, like pipx.

thomasf commented 2 years ago

Please stop declaring whats the "only viable option" is. There are always more than one viable option even if your own particular usage patterns does not require them.

The aim of this discussion should be to figure out hot to satisfy the needs of different kinds of users, not just yourself.

Asday commented 2 years ago

Minimum is only useful for testing, you wouldn't use it in production

Please do not assume. Minimum is exactly what I would use in production, because change control and auditing is an incredibly important part of what I do for compliance reasons, (and also because I care about my users).

It is my job to keep up to date with upgrades that need doing for security reasons. I am able to do that on a per-package level and upgrade only what's needed, and cut down the amount of third party code that needs to be audited. I do not need some break-everything maximum satisfier algorithm to sextuple my workload.

If your workload is eased by a maximum solver then bully for you, pip already works this way and you don't need to have any place in the discussion. Note the issue title is "add a resolver option".

uranusjr commented 2 years ago

Let me categorise it in another way. Minimum is only useful for “testing”, in the sense that users of this option need to check the installtion result to make sure they are useful. The point is that you shouldn’t blindly use some random dependency specifications with open version ranges to install packages for production, but need to perform some kind of “making sure what those dependency specifications do” first (I hope we can agree on that).

For maximum version selection, the process should be:

  1. Declare open version ranges
  2. Convert those open version ranges to deterministic versions (i.e. locking)
  3. Use the result in production

While with minimum version selection, it’s

  1. Declare open version ranges
  2. Audit the ranges so each dependency select a usable minimum version
  3. Use the result in production

And since with minimum version selection means no future versions can change the installation result, the second step is effective the same as locking, without an explicit lock file! So in theoratical terms, specify-lock-install is indeed the only viable way to deploy to production. The only difference is lock can be done in various ways.

Asday commented 2 years ago

They are exactly not the same at all.

With minimum version selection, I declare a minimum version of my dependent packages. It solves (trivially), and I audit the dependency tree. Should any transient packages need security updates, I can submit a PR to their parents in the tree. At any point in time, the same input will produce the same output. I can spin up a new server or dev environment and install the same dependencies and be guaranteed the same results.

With maximum version selection, I declare a minimum AND maximum version of my dependent packages, it solves (non-decidably), and I audit the dependency tree, then I must output a listing of exactly every version of every package decided upon. When I want to update the version of one of my dependent packages, I must walk its tree to see which packages have upper bounds higher than what's in my lockfile and new releases and audit all of those, or not remove them from my lockfile and have the build potentially fail.

Worse than that, I even have to deal with the case where I depend on A<2 which depends on B which releases a new version with which A is incompatible, now forcing me to specify a maximum for B in my direct dependencies, even though I never use it anywhere. With minimal version selection, you have the guarantee that the original author of your dependency tested with their minimum dependency.

Please let's not stymie the great work people have already done in this thread on a minimal solver by claiming - incorrectly - that it's the same as maximal with extra steps.

uranusjr commented 2 years ago

I don’t think you understand what I was trying to say (where did you get the idea I was saying minimum version selection needs more steps?), nor the fact that I already involved a lot in providing the solution (if you care to re-read the thread, I have been one of the people trying to actually help). So I’ll try to not respond to your comments from now on, in the spirit of not stymieing things, and hope you will do the same to my comments.

henryiii commented 2 years ago

I'm strongly in favor of providing options and supporting a wide range of usage patterns. I'm strongly in favor of a minimum solver because I'd like to be able to provide working minimums for my package via testing; currently it is quite tricky to verify minimums, and most packages either don't provide minimums, or have to manually keep up a constraints file to test minimums. Most packages do not provide usable minimums today.

I am not in favor of statements like "Only minimum is inarguably viable, maximum has shown itself to be completely untenable", and was responding with a similarly strong statement for my position (normal solver + lock files) - I wasn't really seriously saying everyone had to do it (I pointed out that at least one group, inexperienced users, do not). My "position" is a common mechanism used in industry and in different ecosystems. Minimum is not inarguably viable, because it's impossible to maintain software that is pinned to a minimum. Even if we have a minimum solver and I could put useful minimums on all my package requirements, I can't verify that the minimums don't include security loopholes, missing wheels, OS/arch incompatibilities, etc. Minimums are intended to help users auto-upgrade incompatible packages, not to be used as replacement for a lock file.

Quick example: Say I support NumPy 1.13.3+. I test this on Python 3.6, and it's true. From my standpoint, I truly do support 1.13.3+ - nothing in my code requires something newer. However, if you try to use the minimum solver in production, you might use Python 3.7 - which requires NumPy 1.14.5+ (each newer Python has a newer minimum NumPy). You will try to build NumPy 1.13.3 from source, and it might or might not work (probably the later, but it's system dependent). Or maybe you are on AIX, which requires NumPy 1.16.0+. Or you are on Linux ARM, which requires NumPy 1.19.2 (technically wheels were released before that, but it had serious bugs until this version). Apple Silicon requires 1.21.0+. Or you are really running PyPy 3.7, which needs NumPy 1.19.0+.

With a "normal" solver, you get 1.23.1 today, which works on all these platforms.

As a library author, it is not my place to work out all these things. I should be able to specify what my library requires based on my usage. A normal user on CPython 3.7 will not have anything older than NumPy 1.14.5 - how could they, it came out when Python 3.7 did, and currently you always start with the latest version. So specifying a minimum will not break them because it doesn't affect them. If, however, they were on Python 3.5 and had an old environment with NumPy 1.12, then having the correct minimum will cause pip to upgrade NumPy, fixing a potential bug. In this scenario, having a way to test with minimums is a clear win for the ecosystem.

If, however, as a production user, you try to install based on minimums, suddenly you are installing versions of software that were never tested on your Python version or your OS and maybe arch. You are getting NumPy 1.13.3 on these systems it very much did not support, because they didn't exist.

It is my job to keep up to date with upgrades that need doing for security reasons. I am able to do that on a per-package level and upgrade only what's needed,

Then you have the equivalent of a lock file (either a real file or a single environment that you are maintaining) that you manually update (and therefore are not using the "minimums" ;) ). That's exactly what I'm describing. The only difference is you want to start with the minimums for your lock file, rather than the maximums. I think that's likely not tenable in practice, since it forces you to perform immediate updates (some of the minimums won't work due to the problems listed above) and more updates (since you don't get the benefit of a support window, everything immediately starts aging out).

Edit: I will also avoid further comments, since this is not helpful to the task at hand. I'd already written it or I would not have even posted this.

Asday commented 2 years ago

Minimum is not inarguably viable, because it's impossible to maintain software that is pinned to a minimum.

Minimal satisfaction is not the same as minimum pinned. I declare I want Django 2 and django-cms, and django-cms' dependency on at least 2.2 means I get Django 2.2.

It is more easily possible to maintain software thats minimally satisfied than maximally because there are less changes. Instead of reading a mountain of diffs every Tuesday, you read one release notes page per dependency and then sometimes read diffs if a security issue is found.

Even if we have a minimum solver and I could put useful minimums on all my package requirements, I can't verify that the minimums don't include security loopholes, missing wheels, OS/arch incompatibilities, etc.

I certainly can - it's what I'm paid for. Do you trust a maximal solver to not introduce security loopholes?

Minimums are intended to help users auto-upgrade incompatible packages, not to be used as replacement for a lock file.

The lock file is a hack that's grown out of a need to patch over the unmaintainability of ever-changing package versions caused by the maximal solver. In a perfect world there shouldn't be a "replacement" for a lock file, there should never have needed to be such a concept in the first place. We're very much stuck in the days before Henry Ford where all people could imagine to want for was "faster horses".


Quick example: Say I support NumPy 1.13.3+. I test this on Python 3.6, and it's true. From my standpoint, I truly do support 1.13.3+ - nothing in my code requires something newer. However, if you try to use the minimum solver in production, you might use Python 3.7 - which requires NumPy 1.14.5+ (each newer Python has a newer minimum NumPy).

Perfect so far.

You will try to build NumPy 1.13.3 from source

No I won't, I believe you've misunderstood the concept of a minimal solver. It takes the minimun named version of a package in the dependency tree and installs that. In your example here, you have said you support 1.13.3, (with minimal version satisfaction the + is implicit). You have also quite correctly determined that specific version by noting that nothing in your code needs anything newer.

However, I have named your package in my dependencies, and named python_requires=">=3.7", thus forcing the minimal named version of numpy to become 1.14.5.

With a "normal" solver, you get 1.23.1 today, which works on all these platforms.

"Today". That is the issue. I need the same input to produce the same output no matter what. (Barring stuff being DMCA'd form PyPI of course). With a maximal solver, tomorrow I could get 1.23.2. The day after I could get 1.24.0. Next it's 2.0 and everything breaks.

With a minimal solver, in production, I get to specify what my minimum is, and then it will only change when I change the dependencies or specify a new minimum. I am never blindsided by innumerate dependent libraries updating themselves because today is upgrade day, I'm never forced to specify an upper bound because nothing's going to wander upwards.

If, however, as a production user, you try to install based on minimums, suddenly you are installing versions of software that were never tested on your Python version or your OS and maybe arch

I have no idea how you've come to this conclusion. I am a production user. I place great emphasis on dev/prod parity, as does anyone I know personally in the industry. There is not a single line of code on production that hasn't been tested on identical (if downsized) architecture before deployment. If your argument is that some people will be developing on ARM Linux machines then deploying directly to intel Windows servers and they'll have problems, I don't see what that has to do with minimal version satisfaction. If your argument is something else I don't see it.

Then you have the equivalent of a lock file

I do not. I have a constraints file which produces reproducible output. Not all flowers are roses. A lockfile is a constraints file that produces reproducible output by way of hard definitions of every output version. This is unmaintainable when it comes to partial upgrades, and full upgrades are also unmaintainable.

I recommend this as reading material.

adamjstewart commented 2 years ago

Realistically, what version selection algorithms besides minimum or maximum is even viable? Sounds to me that a --strategy=... might be an over preparedness for something that likely won't be needed?

From this thread, we can already see that there are at least 3 commonly requested strategies:

  1. Maximum version for all dependencies
  2. Minimum version for all dependencies
  3. Minimum version for direct dependencies, maximum version for indirect dependencies

Between 2 and 3, I actually prefer 3 personally.

In addition to these, I can think of many more possible strategies:

  1. Update all installed dependencies to the latest version
  2. Reuse what is already installed without updating, only install new dependencies
  3. Randomly install any valid version, great for detecting bugs with random releases
  4. Minimize number of packages installed
  5. Maximize number of packages installed (via extras_require)

Of course, you may or may not want to support all of these. Just pointing out that it isn't valid to assume that maximal and minimal are the only two possibilities.

P.S. I'm a developer for the Spack package manager, which builds from source or binary and supports significantly more customizability (choosing compiler, BLAS/LAPACK, extras, etc.) and we've been working on this same issue. We have a PR to add support for minimal version installs (https://github.com/spack/spack/pull/28470) and have discussed many more possible strategies (https://github.com/spack/spack/pull/28468#issuecomment-1015035473). Might be a good reference for how other package managers are doing things.

pradyunsg commented 2 years ago

Hi folks, I see that spirits are high here.

I suggest we let this thread cool off for a day or so, before we post further responses.

thomasf commented 2 years ago

From this thread, we can already see that there are at least 3 commonly requested strategies:

  1. Maximum version for all dependencies
  2. Minimum version for all dependencies
  3. Minimum version for direct dependencies, maximum version for indirect dependencies

Between 2 and 3, I actually prefer 3 personally.

I guess someone might want (3), mostly for testing I guess?

For actually managing project dependencies (3) has the same basic problem as (1) that you don not get a predictable set of dependencies on repeated installations.

In addition to these, I can think of many more possible strategies:

  1. Update all installed dependencies to the latest version
  2. Reuse what is already installed without updating, only install new dependencies
  3. Randomly install any valid version, great for detecting bugs with random releases
  4. Minimize number of packages installed
  5. Maximize number of packages installed (via extras_require)

(4), (7) and (8) are package selection so it is orthogonal to version selection and should probably not share the same command line flag.

(6) is the only one that strictly is about version selection.

uranusjr commented 2 years ago

I think the third strategy mentioned in this thread is the other way around? i.e. maximum for direct dependencies and minimum for trandient ones.

davegaeddert commented 2 years ago

I made a couple more changes to my branch and opened a PR here: #11336

Regardless of the details of a possible third strategy, it sounds like there's potential for more so I left it as a string option. Seems like those could be discussed down the road if someone wants to flesh them out, and won't make for an annoying rename / backwards compatibility issue if they do. I did rename it to --version-selection though because I do agree that some of these other "strategies" and veering (orthogonal-ly) outside of "version selection" and I think some more precise terminology would be helpful (whether that's the term you want to go with, I don't know).

I also removed the interactive prompt when there's a lower bound missing — it will throw an exception now, so you'll have to do something to address it. I can see how constraints could apply to both the application and package use cases, and it's simpler to leave the prompt out for now anyway.

davegaeddert commented 2 years ago

FWIW, I did try using this in CI roughly how I'd plan to use it for the packaging use case. Found some dependency ranges that I needed to modify... The workflow for figuring out the minimum versions was a little rough (for a first time anyway) but so long as pip could do the version selection, I'd guess that you could probably make some tooling around/outside of pip to help identify minimum requirements (for some I read changelogs, for others I did a git bisect-style approach of bumping versions + running tests).

Here's my GitHub workflow example: https://github.com/dropseed/combine/pull/89/files#diff-faff1af3d8ff408964a57b2e475f69a6b7c7b71c9978cccc8f471798caac2c88

thomasf commented 2 years ago

I moved a few smaller (non public) projects over from pipenv or pure requirements.txt on experimental branches. It's going pretty well so far.

It should probably be allowed to use --version-selection in a requirements.txt file.

Asday commented 1 year ago

It should probably be allowed to use --version-selection in a requirements.txt file.

I'd love to back up my :+1: on this by pointing out the insane things PHP developers have had to do to deal with the fact their applications run differently based on their hosting server's PHP interpreter configuration. Obviously slightly different here as the normality is that whoever writes the requirements file is likely the one running pip, and this is declarative configuration rather than code, but the analogue refused to leave my brain.

jayaddison commented 1 year ago

Realistically, what version selection algorithms besides minimum or maximum is even viable? Sounds to me that a --strategy=... might be an over preparedness for something that likely won't be needed?

@thomasf I've arrived on this issue thread after encountering a use case where a different dependency installation version selection mechanism could have been useful.

Here's an attempt to explain:

Scenario

After checking out an old Python project's release tag, I attempted to pip install dependencies into a local venv and then to run its unit tests. However: some of the dependencies have undergone breaking changes, and so the resulting installation set didn't produce a working result.

I don't want to install minimum versions of everything, because I expect that some/many of the components involved have had useful improvements (performance, security, bugfix, ...) since the requirements/constraints were defined.

Desire

What I'd like to handle that use case would be a combination of: install the maximum compatible version within an upper-bounding timestamp (which I might select as, for example, the time the release tag was created - or a few months later). That should install a set of dependencies that worked (and was considered freshest) at that moment-in-time.

pfmoore commented 1 year ago

What I'd like to handle that use case would be a combination of: install the maximum compatible version within an upper-bounding timestamp

Time-based version bounds seem like a relatively common and understandable requirement, but it's not really one that pip (or indeed the Python package versioning scheme) is designed to handle particularly well as things stand.

One thing that could work for this sort of scenario would be a tool that took a date, and read the PyPI metadata for a series of packages and wrote out a constraints file based on the upload times of the files, to constrain the package to only versions released before the given date. That should be a relatively easy tool to write, and wuld give the effect of upper-bounded timestamp constraints without needing any changes to existing infrastructure.

I know that people are typically resistant to building solutions from co-operating tools like this, preferring a "one tool does everything" approach, but if anyone is interested in creating a short-term solution (and possibly a generally useful tool) then maybe it would be worth looking into this possibility.

pradyunsg commented 1 year ago

https://github.com/astrofrog/pypi-timemachine exists!

henryiii commented 1 year ago

It's a bit more work than the proposed tool (it creates a PyPI proxy), but https://pypi.org/project/pypi-timemachine/ does this.

pradyunsg commented 1 year ago

It's coupled to PyPI, but someone could take that forward and extend it to support Artifactory and custom variants similar to how https://github.com/uranusjr/simpleindex allows you to have custom routing strategies.

notatallshaw commented 7 months ago

Small update on anyone who needs this feature, uv has a pip-like install interface and has the option --resolution with possible values "highest", "lowest", and "lowest-direct".

pohlt commented 7 months ago

PDM also offers this functionality.