pypa / setuptools

Official project repository for the Setuptools build system
https://pypi.org/project/setuptools/
MIT License
2.38k stars 1.16k forks source link

Stop vendoring packages #2825

Open jaraco opened 2 years ago

jaraco commented 2 years ago

The current implementation of Setuptools has its dependencies (separately for setuptools and pkg_resources) vendored into the package. The project implemented this approach because of the bootstrapping problem.

Bootstrapping Problem

Setuptools extended distutils, which was designed to sit at the root of a packaging system and presumed to be present and without any dependencies. Over time, Setuptools superseded distutils but inherited these constraints. In particular, when a system integrator wishes to build a system from sources, it requires a directed acyclic graph (DAG) of both build-time and runtime dependencies. As a result, Setuptools cannot depend on any package that uses Setuptools to build (or whose dependencies require Setuptools to build). This bootstrapping problem includes Setuptools itself, as it requires itself to build.

Vendoring

As the ecosystem grew more complex, with standards-based implementations (such as packaging) appearing as third-party packages and not available in the standard library, Setuptools found itself requiring dependencies, but because of the bootstrapping problem, Setuptools adopted a vendoring strategy (copying the packages directly into the project) as a means of requiring that functionality.

However, this approach creates constraints and complications to the project:

jaraco commented 2 years ago

Solution

Recent advancements in specifications and tooling around packaging promise to help address some of the issues above. The solution proposed herein is that Setuptools should:

(a) Declare its own dependencies normally, pinning versions only for known-incompatibilities. (b) Provide a fallback bundle of libraries to be used only during the rare bootstrapping scenario. (c) Detect when the fallback is needed and add those packages to the import path to satisfy that scenario. (d) Rely primarily on PEP 517/518 tooling to ensure dependencies are available both for building. (e) Discourage installation of Setuptools except when needed to build a package. (f) Exclude build-time dependencies from consideration in the DAG (require integrators to follow the pip model for isolated builds).

This last piece (f) is perhaps the most controversial. Let's think of a concrete example, installing appdirs in a world where setuptools declares its dependencies including on appdirs:

pradyunsg commented 2 years ago

I don't think we can do this yet.

Not until pip removes the fallbacks to direct setup.py invocation (https://pip.pypa.io/en/stable/reference/build-system/setup-py/ -- in-progress) and the separation of pkg_resources is completed. Without those, I think this would exceed our available churn budget. :)

(e) Discourage installation of Setuptools except when needed to build a package.

setuptools is installed with pip, by all supported mechanisms to install pip -- https://github.com/pypa/pip/issues/10530#issuecomment-932937829. IMO this should be considered a pre-condition for doing this, and it's worthwhile spending a few months or so publicising this change before we actually make it.


FWIW, pip's solution for this is something I maintain separately: https://github.com/pradyunsg/vendoring (I'll consistently use vendoring to refer to this project in the rest of this post). I'm happy to accomodate for setuptools in that. That should help alleviate many of the pain points with vendoring dependencies.

If you're curious about how the tool works, my suggestion is to clone pip and run tox -e vendoring/nox -s vendoring and look at the vendor.txt, tools/vendoring/patches/*, and pyproject.toml files in pip's source tree.

  • When vendoring a package, it often is required to be rewritten to accommodate vendoring. Any absolute imports must be replaced by relative imports.

  • Because vendoring is a second-class approach to dependency management (and unsustainable in the general case), it often requires specialized tooling to manage the dependencies and this management can often fall in conflict with the first-class tools.

  • When vendoring dependencies, it's the responsibility of the hosting package to re-write imports to point to the vendored copies, creating non-standard usage with sometimes unclear semantics.

  • Due to the constraints above, adding a new dependency can be an onerous process, requiring extra care and testing that may break in downstream environments whose workflows aren't proven.

  • Because vendored dependencies are in fact de-facto satisfied, the project cannot and should not declare those dependencies as other projects do. Therefore, it's not possible to inspect the dependencies readily as one would with standard declarations.

I think these constraints are non-existent / workable, if you adopt vendoring. :)

Because vendored dependencies are effectively pinned, they create impedance and require manual intervention or extra, non-standard tooling to manage the evolution of those dependencies, defaulting to a practice of erosion.

This is somewhat true -- you still have the caveat of needing to manage evolution, but you can use standard tooling (like dependabot) to manage the upgrading, if you go down the vendoring route. It's also possible to have separate unpinned/pinned dependency declaration sets (ALA pip-compile's workflow). You do need to ensure that the entire dependency tree is included and pinned, for using vendoring.

Basically, the moment you start vendoring stuff, you need to start thinking of the project as being managed like an application with pinned dependencies -- all the corresponding dependency management constraints apply (except you run vendoring sync . instead of pip install -r [blah blah] to "install" the dependencies).

Because they're pinned, it's more difficult to discern if a particular dependency is pinned for a known good reason or simply because of the vendoring.

If you adopt vendoring, this is not true.

It is possible to include comments in input file to the tool (it's basically a requirements.txt file, that's consumed by pip), which can be useful to describe this nuance.

Refactoring functionality out of the library is difficult if not impossible due to the constraints above. In particular, this project would like to move pkg_resources into a separate package, but even though pkg_resources has no dependency on setuptools to run, it still must vendor its own dependencies (is this true?).

Yep, and... it shouldn't be too difficult if you use vendoring to bring that package into setuptools.

Not all packages can be vendored.

True. Anything that's non-pure-Python or provides an importable API for plugins is non-vendorable.

If the other package has vendored dependencies, those may not work in a vendored context.

Generally, not true; based on my experience with vendoring. It is possible to rewrite their imports seemlessly.

Some packages have global state or modify global state or have interfaces that are reliant on the package layout (incl. pkg_resources, importlib_metadata, packaging), leading to unexpected failures when loaded in a global context.

I'm not quite sure what you mean here -- setuptools.extern.packaging.version.Version is always going to compare not equal to pip._vendor.packaging.version.Version because they're different packages (and possibly different versions of the packaging too!). If that's what you're referring to, yea... but I don't see this as being a big problem.

The only case this can be an issue is if you expect this difference to exist/not exist in some code -- since that'd be fragile. It's often really straightforward to avoid that though.

jaraco commented 2 years ago

The readme for vendoring talks only about how the project is used for pip only and provides no guidance on how to use it.

I guess you're proposing that I reverse-engineer the usage from pip's usage and then try to apply that to this project. I guess I can give that a try.

jaraco commented 2 years ago

In pradyunsg/vendoring#37, I learned that the vendoring project is uninterested in supporting the complexity of Setuptools, so if Setuptools were to adopt vendoring, it would need to at the very least provide its own orchestration of invoking vendoring across multiple configurations, which is barely preferable and in some ways more duplicative to the current vendoring technique. Moreover, other issues like pradyunsg/vendoring#38 suggest that vendoring requires supplying default values even for basic operation, so that would bring extra cruft that would need to be copied across multiple configs. At some point, I might consider forking vendoring to satisfy some of these needs, but for now, as long as vendoring is needed, Setuptools can continue to use its own technique.

mgorny commented 2 years ago

I would really love to see making unbundling easier without making bootstrap very hard. That said, as a distro packager I don't need it "perfect" from pure pip standpoint.

One possibility that I think might be worth exploring is trying to make a subset of setuptools needed for bootstrap work with smaller number of vendored dependencies, possibly via gracefully handling missing packages and running with limited functionality without them. I think we could at least avoid more-itertools and ordered-set this way.

mgorny commented 2 years ago

Honestly, I don't see this happening anymore, given that every subsequent release of setuptools adds more vendoring, and it literally takes hours to make setuptools work unvendored. On top of this, patched pyproject_validator basically made including new versions of setuptools in Gentoo impossible.

mgorny commented 2 years ago

I'm sorry, nevermind, I can't even read package names right.

abravalheri commented 2 years ago

Hi @mgorny I am sorry the latest vendored packages affect Gentoo.

Regarding validate_pyproject specifically, it is working as a "code-generator" for validator, much like protobuffers would do, instead of vendoring the entire JSON schema library.

Is this approach still problematic? Would it help if I move the validate_pyproject out of the _vendor directory to make the distinction better?

What are the approach you guys use when dealing with projects that use generators?

mgorny commented 2 years ago

The directory doesn't matter much for us, we just mangle all the references to use the external package.

As for generators, we prefer to rebuild everything. I don't know whether this is the case here but normally we assume that e.g. updated dependencies (like cython) could result in different output, possibly fixing some bugs that might have been present in the original generated version.

abravalheri commented 2 years ago

Thank you very much for the information @mgorny. Let's see how we can solve this.

Currently there is a command for re-generating the files from validate-pyproject together with the commands to install the vendored dependencies.

I could also create a separated script for generating the files, does that help? (This script however is going to require Python dependencies, so I guess in the end it would create a dependency cycle... Having the files directly in the repository is a way to break this cycle)

Do you have other suggestion?

mgorny commented 2 years ago

If you don't mind bearing with me for a while more, could we please take a step back and verify whether I'm understanding things correctly? IIUC:

  1. _validate_pyproject is a package generated by validate_pyproject with help of fastjsonschema. The exact contents of _validate_pyproject depend on these two packages and can change if they are updated.
  2. _validate_pyproject doesn't have any setuptools-specific content, i.e. if another project used validate_pyproject, it would get the same result. Therefore, it is possible to share the code between multiple packages.

If both points are correct, then I think the best long-term solution would be to actually split validate_pyproject into two packages installable via pypi: one with the generator, and the other with the generated content. Then setuptools could use the regular kind of vendoring that we're prepared for. It won't be ideal but we can look into the problem of regenerating it separately. Bonus points if you could use flit as the build system for it since it's the only build system that doesn't come with a dozen of cyclic deps.

If that's a bit much, I think moving _validate_pyproject out of _vendor would also help us after all. Out unbundling logic is pretty much two regexps and while I suppose we could hack it around to exclude _validate_pyproject, it would be much easier if we didn't have to.

I'm sorry about my attitude yesterday.

abravalheri commented 2 years ago

Thank you very much for the input @mgorny.

_validate_pyproject is a package generated by validate_pyproject with help of fastjsonschema. The exact contents of _validate_pyproject depend on these two packages and can change if they are updated.

Precisely. The way fastjsonschema works is to compile JSON schema files into Python code[^1]. validate-pyproject adds some structure to this compilation for pyproject.toml + it hosts general schemas that cover PEP 518/621.

_validate_pyproject doesn't have any setuptools-specific content, i.e. if another project used validate_pyproject, it would get the same result.

This is more or less what happens. The idea for validate-pyproject is to be general and host only the models for PEP 518/621. However it is also designed to allow third-party plugins to define models for the [tool.<...>] tables. Therefore you can share the logic between multiple packages, but it is also extendable.

Currently, validate-pyproject is hosting some JSON Schema files for setuptools, but my original idea is to move them to the setuptools repository in the mid-term, so contributors/maintainers can change the way they want [tool.setuptools] to be [^2].

Regarding the proposed solutions, I think both could work. My preference however would be the second approach (because of the mid-term vision of moving the tool-specific specs to setuptools). If that is fine with you I can go ahead and start working in this plan (I will just ask a bit of patience because I am still working on some issues regarding PEP 621 support).

[^1]: You can still run things directly without ever having to save the Python file to the disk, but you are basically eval-ing the generated string.

[^2]: While I was working on bringing support for PEP 621 into setuptools, I just decided to leave the schemas in validate-pyproject for simplicity. But my goal is not to have validate-pyproject to be the gatekeep of how setuptools wants to evolve its configuration (it is too much overhead for everyone).

mgorny commented 2 years ago

Thanks for the explanation. If the end goal is for setuptools to host the schemas, then let's focus on the second option indeed. It'd still be nice to able to easily regen the resulting data but I don't think that's a priority, i.e. something to put on the "far TODO".

jaraco commented 2 days ago

In https://github.com/pypa/setuptools/issues/4455#issuecomment-2203461914, I've been thinking maybe there's a better way to vendor dependencies that's far less intrusive and doesn't require re-writing the vendored packages. I'm exploring that now.

mgorny commented 2 days ago

For the record, one risk I see is that if the user has an incompatible version of the dependency installed, setuptools would use it rather than the "vendored" fallback — but perhaps there's a smart way of avoiding that.

jaraco commented 2 days ago

For the record, one risk I see is that if the user has an incompatible version of the dependency installed, setuptools would use it rather than the "vendored" fallback — but perhaps there's a smart way of avoiding that.

Good point. On one hand, I was thinking that concern is to be addressed the same way it is for any other application or library - that is, declare the best estimate of what dependencies are compatible, put the onus on the user to supply compatible packages (when things break). There are two problems with that way of thinking:

One way to mitigate this concern could be to (a) require that dependencies follow semver, (b) proactively pin against breaking changes, and (c) lean on the dependencies to yank any unexpected breakages that violate semver.

I would like to pursue a world where vendored dependencies are used rarely and there's little to no dependence on their stability. In other words, I'd like for Setuptools to "work at head" of the Python ecosystem (unless explicitly indicated otherwise).

jaraco commented 2 days ago

In #4457, I'm pleased to say I have an initial proof of concept that applies the concept and is largely working. There are some mypy tests failing, but I'm confident those can be fixed or ignored. Thusfar, I've only partially applied the changes. I plan to continue to pursue the approach and replace the pkg_resources dependencies as well, do some cleanup, and address any emergent issues. I won't be rushing out any releases with this change, but I'm optimistic that a future without heavy reliance on vendored packages is possible.

mgorny commented 2 days ago

Oh, one more thing just occurred to me. Some of the setuptools dependencies already avoid the bootstrap problem by using flit_core. This means that they can be built and installed without setuptools, and therefore do not pose a bootstrap problem.

hroncok commented 2 days ago

Note that if you would install them with pip, you still need setuptools first.