Open ghost opened 7 years ago
Stepping back from the specific request of overriding build dependencies, the problem presented in the top post can be avoided by adding additional logic to how build dependencies are chosen. When a package specifies
numpy
(for example) as a build dependency, pip can choose freely any version of numpy. Right now it chooses the latest simply because it’s the default logic. But we can instead condition the logic to prefer matching the run-time environment if possible instead, which would keep the spirit of build isolation, while at the same time solve the build/run-time ABI mismatch problem.
+1 this is a healthy idea in general, and I don't see serious downsides.
Note that for numpy
specifically, we try to teach people good habits, and there's a package oldest-supported-numpy that people can depend on in pyproject.toml
. But many people new to shipping a package on PyPI won't be aware of that.
Something like the situations discussed here has happened today -- setuptools has started rejecting invalid metadata and users affected by this have no easy workarounds.
@jaraco posted #10669, with the following design for a solution.
I imagine a solution in which pip offers options to extend and constrain build dependencies at install time. Something like:
--build-requires=<dependencies or file:requirements> --build-constraints=<constraints or file:constraints>
These additional requirements would apply to all builds during the installation. To limit the scope of the specifications, it should also allow for a limited scope:
--build-requires=<project>:<dependencies or file:requirements> --build-constraints=<project>:<constraints or file:constraints>
For a concrete example, consider a build where
setuptools<59
is needed fordjango-hijack
andsetuptools_hacks.distutils_workaround
is needed for all projects and the deps inscipy-deps.txt
is required formynumpy-proj
:pip install --use-pep517 --build-constraints "django-hijack:setuptools<59" --build-requires "setuptools_hacks.distutils_workaround" --build-requires "mynumpy-proj:file:scipy-deps.txt"
The same specification should be able to be supplied through environment variables.
Stepping back from the specific request of overriding build dependencies, the problem presented in the top post can be avoided by adding additional logic to how build dependencies are chosen. When a package specifies numpy (for example) as a build dependency, pip can choose freely any version of numpy. Right now it chooses the latest simply because it’s the default logic. But we can instead condition the logic to prefer matching the run-time environment if possible instead, which would keep the spirit of build isolation, while at the same time solve the build/run-time ABI mismatch problem.
Some more thoughts I’ve had during the past year on this idea. Choosing a build dependency matching the runtime one is the easy part; the difficult part is the runtime dependency version may change during resolution, i.e. when backtracking happens. And when that happens, pip will need to also change the build dependency, because there’s no guarantee the newly changed runtime dependency has ABI compatibility with the old. And here’s where the fun part begins. By changing the build dependency, pip will need to rebuild that source distribution, and since there’s no guarantee the rebuild will have the same metadata as the previous build, the resolver must treats the two builds as different candidates. This creates a weird these-are-the-same-except-not-really problem that’s much worse than PEP 508 direct URL, since those builds likely have the same name, version (these two are easy), source URL (!) and wheel tags (!!) It’s theoratically all possible to implement, but the logic would need a ton of work.
I imagine a solution in which pip offers options to extend and constrain build dependencies at install time.
And to come back to the “change the build dependency” thing. There are fundamentally two cases where an sdist’s build dependencies need to be overridden:
pyproject.toml
, re-package, and seamlessly tell pip to use that new sdist. pip likely still needs to provide some mechanism to enable the last “seamlessly tell pip” part, but the rest of the workflow does not belong in pip IMO, but a separate tool. (It would be a pip plugin if pip has a plugin architecture, but it does not.)And here’s where the fun part begins. By changing the build dependency, pip will need to rebuild that source distribution, and since there’s no guarantee the rebuild will have the same metadata as the previous build, the resolver must treats the two builds as different candidates.
I'm not sure I agree with that. Yes, it's technically true that things could now break - but it's a corner case related to the ABI problem, and in general
pip install pkg_being_built
results in exactly even on the same machine today. pip
does not take versions of build dependencies into account at all in its current caching strategy.A few thoughts I've had on this recently:
<=last_version_on_pypi
, sometimes <=next_major_version
, sometimes <= 2_years_into_the_future
). This discussion shows why. It is unlikely to cause problems (because of build isolation), and guaranteed to avoid problems (sooner or later a new setuptools
will break your packages' build for example).oldest-supported-numpy
. Also, we have detailed docs for depending on NumPy..postX
release to fix a broken x.y.z
package version, and doing a release can be an extremely time-consuming operation.I agree it should mostly work without the rebuilding part, but things already mostly work right now, so there is only value to doing anything for the use case if we can go beyond mostly and make things fully work. If a solution can’t cover that last mile, we should not persue it in the first place because it wouldn’t really improve the situation meaningfully.
I listed later in the previous comment the two scenarios people generally want to override metadata. The former case is what “mostly works” right now, and IMO we should either not do anything about it (because what we already have is good enough), or persue the fix to its logical destination and fix the problem entirely (which requires the resolver implementation I mentioned).
The latter scenario is what we don’t currently have a solution that even only “mostly” works, unlike the former, so there’s something to be done, but I’m also arguing that something should not be directly built into pip entirely.
Looking at this issue and the similar one reported in #10731, are we looking at this from the wrong angle?
Fundamentally, the issue we have is that we don't really support the possibility of two wheels, with identical platform tags, for the same project and version of that project, having different dependency metadata. It's not explicitly covered in the standards, but there are a lot of assumptions made that wheels are uniquely identified by name, version and platform tag (or more explicitly, by the wheel filename).
Having scipy wheels depend on a specific numpy version that's determined at build time, violates this assumption, and there's going to be a lot of things that break as a result (the pip cache has already been mentioned, as has portability of the generated wheels, but I'm sure there will be others). I gather there's an oldest-supported-numpy
package these days, which I assume encodes "the right version of numpy to build against". That seems to me to be a useful workaround for this issue, but the root cause here is that Python metadata really only captures a subset of the stuff that packages can depend on (manylinux hit this in a different context). IMO, allowing users to override build requirements will provide another workaround[^1] in this context, but it won't fix the real problem (and honestly, expecting the end user to know how to specify the right overrides is probably optimistic).
If we want to properly address this issue, we probably need an extension to the metadata standards. And that's going to be a pretty big, complicated discussion (general dependency management for binaries is way beyond the current scope of Python packaging).
Sorry, no answers here, just more questions 🙁
[^1]: Disabling build isolation is another one, with its own set of problems.
I think being able to provide users with a way to say "I want all my builds to happen with setuptools == 56.0.1" is worthwhile; even if we don't end up tackling the binary compatibility story. That's useful for bug-for-bug compatibility, ensuring that you have deterministic builds and more.
I think the "fix" for the binary compatibility problem is complete rethink of how we handle binary compatibility (which is a lot of deeply technical work) which needs to pass through our standardisation process (which is a mix of technical and social work). And I'm not sure there's either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.
If there is interest and we think the value is sufficient, I'm afraid I'm still not quite sure how tractable the problem even is and where we'd want to draw the line of what we want to bother with.
I'm sure @rgommers, @njs, @tgamblin and many other folks will have thoughts on this as well. They're a lot more familiar with this stuff than I am.
As for the pip caching issue, I wonder if there's some sort of cache busting that can be done with build tags in the wheel filename (generated by the package). It won't work for PyPI wheels, but it should be feasible to encode build-related information in the build tag, for the packages that people build themselves locally. This might even be the right mechanism to try using existing semantics of toward solving some of the issues.
Regardless, I do think that's related but somewhat independent of this issue.
To be clear, build tags are a thing in the existing wheel file format: https://www.python.org/dev/peps/pep-0427/#file-name-convention
@pfmoore those are valid questions/observations I think - and a lot broader than just this build reqs issue. We'd love to have metadata that's understood for SIMD extensions, GPU support, etc. - encoding everything in filenames only is very limiting.
(and honestly, expecting the end user to know how to specify the right overrides is probably optimistic).
This is true, but it's also true for runtime dependencies - most users won't know how that works or if/when to override them. I see no real reason to treat build and runtime dependencies in such an asymmetric way as is done now.
If we want to properly address this issue, we probably need an extension to the metadata standards. And that's going to be a pretty big, complicated discussion (general dependency management for binaries is way beyond the current scope of Python packaging).
Agreed. It's not about dependency management of binaries though. There are, I think, 3 main functions of PyPI:
This mix of binaries and from-source builds is the problem, and in particular - also for this issue - (3) is what causes most problems. It's naive that we expect that from-source builds of packages with complicated dependencies will work for end users. This is obviously never going to work reliably when builds are complex and have non-Python dependencies. An extension of metadata alone is definitely not enough to solve this problem. And I can't think of anything that will really solve it, because even much more advanced "package manager + associated package repos" where complete metadata is enforced don't do both binary and from-source installs in a mixed fashion.
And I'm not sure there's either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.
I have an interest, and some budget, for thoroughly documenting all the key problems that we see for scientific & data-science/ML/AI packages in the first half of next year. In order to be at least on the same page about what the problems are, and can discuss which ones may be solvable and which ones are going to be out of scope.
Regardless, I do think that's related but somewhat independent of this issue.
agreed
I agree that being able to override build dependencies is worthwhile, I just don't think it'll necessarily address all of the problems in this space (e.g., I expect we'll still get a certain level of support questions from people about this, and "you can override the build dependencies" won't be seen as an ideal solution - see https://github.com/pypa/pip/issues/10731#issuecomment-995544692 for an example of the sort of reaction I mean).
To be clear, build tags are a thing in the existing wheel file format
Hmm, yes, we might be able to use them somehow. Good thought.
And I'm not sure there's either appetite or interest in doing all of that right now. Or if it would justify the churn budget costs.
I think it's a significant issue for some of our users, who would consider it justified. The problem for the pip project is how we spend our limited resources - even if the packaging community[^1] develops such a standard, should pip spend time implementing it, or should we work on something like lockfiles, or should we focus on critically-needed UI/UX rationalisation and improvement - or something else entirely?
I see no real reason to treat build and runtime dependencies in such an asymmetric way as is done now.
Agreed. This is something I alluded to in my comment above about "UI/UX rationalisation". I think that pip really needs to take a breather from implementing new functionality at this point, and tidy up the UI. And one of the things I'd include in that would be looking at how we do or don't share options between the install process and the isolated build environment setup. Sharing requirement overrides between build and install might just naturally fall out of something like that.
But 🤷, any of this needs someone who can put in the work, and that's the key bottleneck at the moment.
[^1]: And the same problem applies for the packaging community, in that we only have a certain amount of bandwidth for the PEP process, and we don't have a process for judging how universal the benefit of a given PEP is. Maybe that's something the packaging manager would cover, but there's been little sign of interaction with the PyPA from them yet, so it's hard to be sure.
/cc @s-mm since her ongoing work has been brought up in this thread!
@rgommers:
We'd love to have metadata that's understood for SIMD extensions, GPU support, etc.
I think this is relevant as we (well, mostly @alalazo and @becker33) wrote a library and factored it out of Spack -- initially for CPU micro-architectures (and their features/extensions), but we're hoping GPU ISA's (compute capabilities, whatever) can also be encoded.
The library is archspec
. You can already pip install it. It does a few things that might be interesting for package management and binary distribution. It's basically designed for labeling binaries with uarch ISA information and deciding whether you can build or run that binary. Specifically it:
zen2
binary compatible with cascadelake
?", or "will an x86_64_v4
binary run on haswell
?" (we support generic x86_64 levels, which are also very helpful for binary distribution)avx512
?)We have gotten some vendor contributions to archspec
(e.g., from AMD and some others), but if it were adopted by pip
,I think we'd get more, so maybe a win-win? It would be awesome to expand the project b/c I think we are trying to solve the same problem, at least in this domain (ISA compatibility).
More here if you want the gory details: archspec paper
@pradyunsg:
I think being able to provide users with a way to say "I want all my builds to happen with setuptools == 56.0.1" is worthwhile; even if we don't end up tackling the binary compatibility story.
Happy to talk about how we've implemented "solving around" already-installed stuff and how that might translate to the pip solver. The gist of that is in the PackagingCon talk -- we're working on a paper on that stuff as well and I could send it along when it's a little more done if you think it would help.
I think fixing a particular package version isn't actually all that hard -- I suspect you could implement that feature mostly with what you've got. The place where things get nasty for us are binary compatibility constraints -- at the moment, we model the following on nodes and can enforce requirements between them:
The big thing we are working on right now w.r.t. compatibility is compiler runtime libraries for mixed-compiler (or mixed compiler version) builds (e.g., making sure libstdc++, openmp libraries, etc. are compatible). We don't currently model compilers or their implicit libs as proper dependencies and that's something we're finally getting to. I am a little embarrassed that I gave this talk on compiler dependencies in 2018 and it took a whole new solver and too many years to handle it.
The other thing we are trying to model is actual symbols in binaries -- we have a research project on the side right now to look at verifying the compatibility of entry/exit calls and types between libraries (ala libabigail or other binary analysis tools). We want to integrate that kind of checking into the solve. I consider this part pretty far off at least in production settings, but it might help to inform discussions on binary metadata for pip
.
Anyway, yes we've thought about a lot of aspects of binary compatibility, versioning, and what's needed as far as metadata quite a bit. Happy to talk about how we could work together/help/etc.
The library is
archspec
. You can already pip install it. ... More here if you want the gory details: archspec paper
Thanks @tgamblin. I finally read the whole paper - looks like amazing work. I'll take any questions/ideas elsewhere to not derail this issue; it certainly seems interesting for us though, and I would like to explore if/how we can make use of it for binaries of NumPy et al.
After pyproject.toml: If scipy uses
requires = ["numpy"]
, then you get a forced upgrade of numpy and all the other issues described above, but it does work. Not so great
FTR one workaround that hasn't been mentioned in the thread is supplying a constraints file set via the PIP_CONSTRAINT
environment variable. This does work for pinning the build deps and is probably the only way to influence the build env for the end user, as of today.
If the target computer already has a satisfactory version of numpy, then the build system should use that version. Only if the version is not already installed should pip use an isolated environment.
Related: scipy/scipy#7309