psf / fundable-packaging-improvements

Packaging improvements that could be funded
51 stars 19 forks source link

Create a generic wheel-building service #19

Open xmunoz opened 3 years ago

xmunoz commented 3 years ago

Background: https://github.com/pypa/packaging-problems/issues/25

Create a generic wheel-building service to make releases faster and more robust.

agronholm commented 2 years ago

It's not clear to me what this is about. Can you do a short write-up here?

di commented 2 years ago

I'll take a stab at it:

The idea would be for PyPI or the PSF to provide a public-good service which allows users to upload a source distribution and have wheels automatically built for specific platforms and architectures. This would allow users to easily target a wide range of built distributions, and would also let us provide a 'canonical' way to build wheels, e.g. with the latest standards and best practices, without having to configure complicated CI pipelines for every individual project. The service could automatically publish what it builds to PyPI on behalf of the maintainers.

This would be a fairly complex project because at the surface, it's extremely similar to a CI service like GitHub Actions. It would require:

A 'fundable' version of of this would probably be an MVP and likely won't include more esoteric architectures, and maybe only one platform (i.e. the most popular one). It also might not support much, if at all, configuration of the build, so it wouldn't initially support projects with more specific needs in their build environments. I think, at a minimum, it would be able to turn a source distribution into a pure-Python wheel, and publish it.

tiran commented 2 years ago

Excellent idea, @di! I have been proposing a similar service for a while. :)

Instead of allowing users to upload a source distro, I would rather start a step earlier. My idea for a MVP looks like this

The steps assume that the project uses a standard layout with a well-configured pyproject.toml in its root directory.

This MVP would simplify most build steps for pure Python projects and does not require much resources on our side. Binary wheels are typically more complicated to build because most projects need additional dependencies and custom build environments.

agronholm commented 2 years ago

Honest question: does this actually require a separate service? Wouldn't documenting the procedure with existing CI services suffice? Case in point: https://github.com/agronholm/cbor2/blob/master/.github/workflows/publish.yml Related documentation: https://cibuildwheel.readthedocs.io/en/stable/

di commented 2 years ago

Instead of allowing users to upload a source distro, I would rather start a step earlier.

IMO, extending this to build source distributions as well should absolutely be a goal here, but having to integrate against source repositories doesn't sound like an MVP to me 🙂 (also at that point it's not really a 'generic wheel-building service')

I also think the 'build a wheel from a source distribution uploaded to PyPI' is going to be an important flow for projects that may not have upstream repos that we support, or may not have upstream repos at all.

Honest question: does this actually require a separate service? Wouldn't documenting the procedure with existing CI services suffice?

It certainly doesn't require it, and I expect that if this existed there would still be strong reasons to use existing CI services, but running a separate service that the PSF/PyPI owns and operates has some advantages:

I think it's likely that such a service would leverage some or all of cibuildwheel to execute the actual build

agronholm commented 2 years ago

it wouldn't require users to configure one or more CI services to target multiple platforms/architectures

It would still require CI changes to leverage this service, yes?

users wouldn't even need to be fully aware of what platforms/architectures they can build built distributions for

I have a hard time believing this one. The number of possible platform/architecture combinations is pretty mind blowing, and building every possible combination for every release is going to be a massive resource drain.

it can be kept up to date with the latest tools/services. CI configs might get old or go stale, e.g. building against new versions of Python

cibuildwheel already handles this, yes?

we can make it free for PyPI users and provide some guarantees on cost there. a CI service might not be free (or always be free)

I'm confused – are you thinking of closed source projects? Because both GitHub and Gitlab are free for F/OSS projects. Closed source projects, on the other hand, are not likely to send their source code to third parties.

we can add offer features that are important to us that might not be available uniformly across all CI providers (such as build provenance, isolated builds & attestations)

Aren't all builds isolated with cibuildwheel? I don't know about the other two features as I have no clue what they are.

brettcannon commented 2 years ago

Another possibility is PyPI provides a service the communicates with a service run by various platform providers that perform the actual builds. For example, if Red Hat wanted to provide CentOS builds for things, PyPI would use some API w/ a service run by Red Hat that took the sdist, did the wheel build, and sent it all back to PyPI to use and display to the user. That way PyPI acts as the integration point while the various platform providers handle the actual building.

Drawback to this is who is going to do this for the platforms that choose not to participate? Who does the manylinux builds? Or those platforms that lack enough funding to pull this off (e.g. does the FreeBSD Foundation have what it would take to participate if we opened up FreeBSD wheels)? But it could allow for more platforms to participate when they do have the funds to do this sort of thing. It also means we don't have to be the experts in the building.

Now having said all of that, I still like Dustin's idea more. 🙂 We could get the platform vendors involved to donate what is necessary to use their services to do the build, but we still ultimately control the service and code. It would also help share the knowledge required to make this sort of thing work and increase transparency which is critical for this sort of thing for the wheels to be trusted.

I will also say that Dustin's idea leads to reproducibility by providing all the details used to build the wheels. Once again, it's a security thing.

Lastly, it can act as a 🥕 for projects to use modern packaging practices as that's going to be the easiest way to make sure there isn't a ton of custom code for every project's special way of being built.

di commented 2 years ago

Hey @agronholm, sort of sensing that you're feeling strongly opposed to this idea but I don't really have a good sense of why based on your comments. If it's for a reason other than "it would take a large investment of time/money" (which I think is acknowledged by the fact that we're considering this to be a project we'd need to seek funding for) then I'd love to hear what your hesitation is.

It would still require CI changes to leverage this service, yes?

I think ideally this would be configured entirely through PyPI or via the service itself, depending on their relationship. Either it would consume the source distribution from PyPI when it gets published there, or it would integrate directly with the upstream repo like @tiran is describing.

I have a hard time believing this one. The number of possible platform/architecture combinations is pretty mind blowing, and building every possible combination for every release is going to be a massive resource drain.

You're absolutely right, which means that such a service would likely only be limited to the combinations that are most important or that people care the most about -- I don't think it would ever achieve complete coverage and that shouldn't be a goal. Which means that the average user can probably depend on the service to build for all combinations the service supports without really having to think about what those are. And users that do need to build for esoteric combinations that the service doesn't support would need to do some portion of their build elsewhere anyways, but they could still use the service for the combinations it supports.

cibuildwheel already handles this, yes?

This assumes end-users are blindly upgrading to the latest cibuildwheel every time a new version is published without doing version pinning or hash pinning, which IMO is not a best practice in terms of build integrity.

I'm confused – are you thinking of closed source projects? Because both GitHub and Gitlab are free for F/OSS projects. Closed source projects, on the other hand, are not likely to send their source code to third parties.

I'm talking about an admittedly hypothetical situation where GitHub/Gitlab become non-free or ineffective to use at the free tier a la Travis CI. At the end of the day, these are services offered by for-profit companies and their ultimate goals are not necessarily aligned with offering free compute to OSS developers forever, whereas this is very much inline with the mission and goals of the PSF and PyPI.

Aren't all builds isolated with cibuildwheel? I don't know about the other two features as I have no clue what they are.

They are isolated in the sense that the underlying CI job is isolated, but not in the sense that each individual build is guaranteed to be isolated -- a user might invoke cibuildwheel multiple times in a single CI job, e.g. in the case where a given source repo ships multiple distinct projects to PyPI. Not super common but it does exist.

By 'build provenance' I'm referring to the build environment providing a non-forgeable and cryptographically signed accounting of exactly went into the build and what commands were run as part of it -- it's fair though to say that some of the environment providers that users currently use are also working on providing this right now.

By 'attestations' I'm talking about something like https://in-toto.io/, which, again, some (but not all) of the build environments are working to provide by default, and also users can manually run themselves.

di commented 2 years ago

Also, for anyone not familiar with this repo: this issue is about updating our list of projects which could use funding with an entry about the idea in question, not about actually implementing it.

So all we're trying to decide here is if this is a) something we want b) something sufficiently large that it requires funding c) something sufficiently small that it's actually reasonable to fund and d) something that potential funders would be interested in, all with the assumption that much more design work would go into this at the point where someone has expressed interest in funding it.

agronholm commented 2 years ago

Hey @agronholm, sort of sensing that you're feeling strongly opposed to this idea but I don't really have a good sense of why based on your comments. If it's for a reason other than "it would take a large investment of time/money" (which I think is acknowledged by the fact that we're considering this to be a project we'd need to seek funding for) then I'd love to hear what your hesitation is.

I'm just bringing up these questions to determine if this project is feasible, and what its actual goals are. I've had this same idea for a long time but then I figured it would never pan out due to the massive computing resource requirements and scalability challenges, so I'm curious to learn what the plan would be.

di commented 2 years ago

Cool, thanks for kicking off the discussion and raising the points, they're totally valid.

wesm commented 2 years ago

You all might look into the experience of the conda-forge community around this, which has automated builds of thousands of Python packages and their C/C++ binary dependencies (including whole compiler toolchains) for multiple platforms and architectures, relying on public CI services for the automation:

https://conda-forge.org/

di commented 2 years ago

Thanks @wesm! I'm aware of conda-forge, but not sure if other folks here are. I think one big difference between it and what's being proposed here is the reliance on public CI services, but otherwise we probably have something to learn from them if we go down this path.

joshuagl commented 2 years ago

I think a generic wheel building service is a great idea that could help address the often-unintentional differences between source code repositories and the artefacts on PyPI. Furthermore, this will be an excellent basis for increased supply chain security in the Python ecosystem i.e. providing additional/different TUF integration points than those proposed in PEP-458 and PEP-480.

SantiagoTorres commented 2 years ago

+1 on this idea! I also think it would be relatively easy to add all the bells and whistles (at their respective time):

  1. Could sign TUF metadata for those who do not want to manage keys
  2. Could submit SLSA/in-toto attestations for build provenance
  3. Could perhaps even use workload identities from fulcio/sigstore for each run or so :)
trishankatdatadog commented 2 years ago

Great idea, @di, long time coming, and excited to see traction building on this! Agree with both @joshuagl and @SantiagoTorres that the auto wheel-builder service will be excellent an security "chokepoint" for adding TUF/in-toto/SLSA/sigstore/etc metadata.

brettcannon commented 2 years ago

Another benefit to this is moving the entire community to a new Python release very quickly compared to having to wait for wheels to percolate up your dependency tree from leaf nodes to your project. This also benefits Python itself as it ups the chances people will test e.g. Python betas to help find issues sooner and lead to a more stable Python release.

westurner commented 1 year ago

Signed, Reproducible builds from source off a trusted build farm are possible with conda-forge, emscripten-forge, Ubuntu PPA, Fedora COPR, and OpenSUSE OBS Open Build System .

From https://news.ycombinator.com/item?id=36045057 :

What command(s) do I pass to pip/twine/build_pyproject.toml to build, upload, and install a package with a key/cert that users should trust for e.g. psf/requests?

Conda-Forge:

SLSA (TUF && Sigstore && Build Attestation)

Cost Estimates: