`python_distribution` should generate PEP 517 and PEP 621 `pyproject.toml` files

Eric-Arellano commented 3 years ago

The Python packaging ecosystem has been working hard to bring more standardization, such that you don't "have" to use pip and setuptools.

PEPs 517 and 621 are particularly relevant for metadata for packages. Pants should generate a pyproject.toml rather than setup.py. This allows Pants to be a better community member.

At first, we could have an option that allows users to still generate setup.py like before, and then possibly deprecate that such that we always use PEPs 517 and 621.

bryanwweber commented 2 years ago

I'm interested in working on this support. Has anyone started on it yet? I'd hate to duplicate work here ☺️

benjyw commented 2 years ago

Hi! Thanks for looking into this! No one has started work on this. I can give some guidance once you've familiarized yourself with the code.

Eric-Arellano commented 2 years ago

@benjyw thoughts on the user API for this? Should it be an option whether we generate setup.py vs pyproject.toml, or always do pyproject.toml?

At a minimum, we should generate pyproject.toml that says the build backend is setup.py. That's the path of least resistance: we don't need to change any of our setup.py code, including the plugin hook for users to add custom values. We only need to additionally generate pyproject.toml

bryanwweber commented 2 years ago

🎉 Thanks for your comments!

In terms of the user API, this is a little bit tricky I think. I think the build backend should at least be configurable in some way; the project for which I want to use this feature has flit as the backend. Whether this is done on the user side by pre-configured selection (pdm, flit, setuptools, hatch, poetry all spring to mind as options), by users pasting their desired build-system table into the BUILD file, or whether this is even generated at all (see the next paragraph) is an important question to resolve, I think.

Aside from the user API in BUILD files, I think there are some unresolved questions for the implementation. Projects are quite likely to have existing pyproject.toml files that have information beyond what used to be stored in setup.py and setup.cfg. In that case, they'll only want some information added by pants. For example, black and pytest configuration could be pre-existing in the pyproject.toml files. So I think it's likely that fully overwriting the pyproject.toml in the isolated build environment is not a great idea.

In addition, to the extent that PEP 621 project.dependencies and project.optional-dependencies are contained in the pyproject.toml instead of in requirements.txt, I think for the most part users aren't going to want to have pants change that too much. The one exception (which happens to be my use case 😉 ) is a monorepo with multiple, interdependent, packages. In this case, it'd be super useful to have pants add the dependency dynamically, but that might be a down the road idea.

The last thing I can think of are the other tables in PEP 621, like entry points and scripts. Tools like flit require these to be in the project table in pyproject.toml, whereas setuptools (if I understand correctly) has either partial support or dual support for setup.py kwargs.

Anyhow, my first plan is to add PEP 621 support for python_requirements specifications, as discussed in https://github.com/pantsbuild/pants/issues/16289. I think that implementation can help inform some of these decisions as well.

benjyw commented 2 years ago

Ah, that is interesting. So before proceeding it would be good to understand the underlying need.

[I should mention (in case it wasn't clear) that Pants already supports flit and any other PEP-517 backend, with user-authored config. But it will only generate config for you in the form of setup.py, for the legacy setuptools backend.]

So am I right in assuming that the need has to do with mixing generated and handwritten config? You want to use flit, and maybe you already have handwritten config for it, but you don't want to check in and maintain the parts of the config that Pants already knows about and is capable of generating? (e.g., 3rdparty requirements, figuring out which firstparty sources go in which distribution, etc.)

I ask because if Pants is generating 100% of the config, then it seems, naively, less important to require one PEP-517 backend over another? Since in that case the user isn't interacting with the backend anyway?

bryanwweber commented 2 years ago

[I should mention (in case it wasn't clear) that Pants already supports flit and any other PEP-517 backend, with user-authored config. But it will only generate config for you in the form of setup.py, for the legacy setuptools backend.]

Yep, I've got it working with flit 😄 I had to disable the setup.py generation, since I didn't want that file, but that was fine.

So am I right in assuming that the need has to do with mixing generated and handwritten config? You want to use flit, and maybe you already have handwritten config for it, but you don't want to check in and maintain the parts of the config that Pants already knows about and is capable of generating? (e.g., 3rdparty requirements, figuring out which firstparty sources go in which distribution, etc.)

I ask because if Pants is generating 100% of the config, then it seems, naively, less important to require one PEP-517 backend over another? Since in that case the user isn't interacting with the backend anyway?

That's a good reframing of the question, thanks! I agree that if pants is generating 100% of the config, then the user can be agnostic to the specific choice of backend. Some users will want to control that anyways, but if pants can support mixing generated and handwritten config, that should be fine.

That said, I think that there is always going to be some mix of generated and handwritten config anyways, because if you specify [project.dependencies] to handle 3rdparty requirements, then ideally I think that would go in a pyproject.toml that's checked in to the repo (and read into pants via #16289). So, except in the very simplest of cases, I don't think it will be possible to avoid mixing generated and handwritten config.

benjyw commented 2 years ago

Yeah, that makes total sense. In practice people end up writing plugins to customize setup.py generation, and that's a bit heavyweight for my liking.

So I think what we actually maybe want is templating? We already use janky templating for generating setup.py: https://github.com/pantsbuild/pants/blob/f6b2832539e26ddffb19a7d665b9cef697abd0da/src/python/pants/backend/python/goals/setup_py.py#L623

So a good approach might be to generalize that.

benjyw commented 2 years ago

So we can provide some standard templates, but you can configure them if you want to modify any static config, and we can figure out a simplified plugin mechanism for adding dynamic ones.

benjyw commented 2 years ago

Then you could, e.g., use your existing config as starting points for templates

bryanwweber commented 2 years ago

Hi @benjyw, I was finally able to get some time to work on this. I have a working although very simple/dumb pyproject.toml generator that basically copies the existing setup.py generator as you suggested. Happy to push a branch/make a PR (to my fork or the upstream here) if it's helpful!

I do have a question about the existing setup.py support. Setuptools, at least as of 61.0.0, fully supports PEP 621 in pyproject.toml files, with beta support for other setuptools-specific metadata (like packages, zip_safe, etc.). That version was released in March of 2022.

So my question is, does it make sense at this point to drop support for generating the setup.py file at all and move everything into pyproject.toml? pants can use setuptools as the default backend for some consistency with previous behavior. I imagine there will be a long deprecation period for setup.py generation, as people will have written plugins to customize behavior as you noted.

This is also a scope question, because adding simple support for generating pyproject.toml files is rather easy (I did it in a day or two, most of which was spent trying to track down the rule graph for the package goal for my own understanding). On the other hand, thinking about how to consistently support packaging workflows seems like a much bigger question... or maybe I'm just overthinking it 😄

My second question is regarding a templating approach more generally. Presuming that people use pyproject.toml to store things like dependencies and more, how can I correctly merge information from pyproject.toml and what might be specified as kwargs to python_artifact? It seems like there's already a lot of code to correctly manage this for the existing setup.py support, so I don't want to break anything 😄

benjyw commented 2 years ago

Hmm, I don't think we'd want to deprecate setup.py generation. Or at least not until we had a lot of experience with the alternatives. And also, I think that what I'm suggesting re templating would obviate that question.

Here's roughly what I had in mind:

Right now, there is a lot of code that generates data for a setup.py, and then a relatively small amount of code that combines that data with some user-provided data on the target, and then actually jams that combined data into a setup.py, via a janky template.

What I'm proposing is that we keep all that code that generates data, but factor out the setup.py-generating part, and replace it with a layer that takes that data and knows how to generate a few different things from it, using "proper" templates.

Then, we'd provide a handful of useful templates (pyproject.toml, setup.py to start with), but users can override those. We can use a simple, standard template language like mustache.

So now users have two easy ways to affect the output: they can modify the template, and they can add data in the target. When a field is set on a target, it would override (or extend, where relevant) the generated value for that field.

Does that make sense?

bryanwweber commented 2 years ago

What I'm proposing is that we keep all that code that generates data, but factor out the setup.py-generating part, and replace it with a layer that takes that data and knows how to generate a few different things from it, using "proper" templates.

At a high level, this makes sense. At the level of implementation, less so 😄 Let me try and ~talk~ write through some of the details, so I can see if it makes sense.

Right now, generate_setup_py fills in the string SETUP_BOILERPLATE template using str.format based on the kwarg values calculated in determine_finalized_setup_kwargs (which itself calls a few other functions to find things like first and 3rd party dependencies, package sources, and such). Whether or not the setup.py file is generated or copied from the working directory is determined by the generate_setup kwarg to python_distribution().

What you're proposing is to add a pair of new fields (?) to the python_distribution() target (? or the python_artifact?), something like generate_build_config and build_config_template. The former kwarg, if True, fills in a template using kwargs from the python_artifact alias and the computed kwargs from determine_finalized_setup_kwargs. Then, build_config_template is a string which is either setup_py, pyproject, or the path to a user-provided template file (or a resource/file target?). If build_config_template is setup_py, the existing logic to generate setup.py runs. If build_config_template is pyproject, the kwarg names are normalized to their PEP 621 equivalents and filled in to a pyproject.toml file which is written to the chroot for packaging. In the final case, the user-provided template is parsed so that pants can update any relevant values (adding first-party dependencies comes to mind), but otherwise the file is written out as-is to the chroot.

Edit to add: I suppose that the default template should be from an existing pyproject.toml file, if pyproject is configured? That'd be what I'd expect to happen, anyways.

Then, generate_build_config and generate_setup are mutually exclusive, with the former defaulting to False, possibly changing to default True in the future.

Does that track what you're thinking?

benjyw commented 2 years ago

Yes, that is pretty much exactly what I had in mind!

Possibly we don't need generate_build_config at all: if build_config_template is set then we use it, if not, we don't.

Also, we wouldn't need to parse the user-provided template. I was thinking that it would be a standard mustache template, and so would the built-in templates. So all we'd do is take the "finalized setup kwargs" (under a more generic name) and let the template reference them via mustache {{...}} placeholders.

Or we can use jinja2 instead of mustache, if we want to get fancy. But we already use mustache in the pants repo, and it's very simple, so it's not a bad place to start.

I think we would always normalize the kwargs to their PEP-621 names, and just change how we do the setup.py case a bit, to use those names. So instead of the lazy str.format thing we do today, we'd properly slot each field into its appropriate setup.py place.

benjyw commented 2 years ago

Or, we can provide a special shortcut field whose value is "all the other fields, serialized as a python dict", and another that is "all the other fields, serialized as YAML", because those might be useful, rather than having to template this field by field in the simple case.

tgolsson commented 2 years ago

Started writing in the thread; but posting my message here instead. My assumptions are/were essentially:

Given a pyproject.toml in my repository
with PEP621 metadata for my project
PEP518 for any tools I might run outside of Pants

Then as a user I'd expect/want a generated package to be built approximately as follows:

Pants injects dependencies in PEP440 format into PEP621 block
Pants adds build-system block following PEP518 to select a PEP517 backend
Pants adds appropriate PEP518 tool table entry to configure the selected PEP517 backend
- This'd include setting data-includes, entrypoints, etc, not just code.

It'd maybe even make sense to put most PEP621 metadata in my BUILD file; and have that also generated. I don't think any tools should depend on PEP621 outside of building, and in my experience they don't. I also don't have enough experience with different PEP517 backends to note any major differences that I care about.

bryanwweber commented 2 years ago

Thanks @tgolsson! That's super helpful and more-or-less matches my user expectation as well. I have one question, with regard to this point:

Pants adds build-system block following PEP518 to select a PEP517 backend

Do you care what backend is selected? If you did, how would you want to tell pants about that? The concern I have is that each backend has a different way to be configured, and pants shouldn't (IMO) need to store all that data (plus what happens for a new backend, etc.).

What @benjyw is suggesting is that if you have a pyproject.toml file in your repository, it will be treated as a template. So the user would provide something like:

[project]
name = "{{name}}"
version = "{{version}}"
dependencies = ["requests"]

In that vein, you (the user) could add, if you wanted to select a backend:

[build-system]
requires = ["flit_core"]
build-backend = "flit_core.buildapi"

[tool.flit.module]
name = {{packages[0]}}  # because flit only allows one package

where pants fills in all the templated variables in double-moustache braces, including modifying the dependencies array as necessary to include detected first-party dependencies. Would that work? I think for historical reasons, the setuptools backend would be the one that pants would use, if the user didn't want to specify one.

tgolsson commented 2 years ago

Do you care what backend is selected? If you did, how would you want to tell pants about that? The concern I have is that each backend has a different way to be configured, and pants shouldn't (IMO) need to store all that data (plus what happens for a new backend, etc.).

I do not care, no. I always pick whatever is suggested by the tool either way. So far I haven't found any major difference in how different backends work - I'd expect the supported feature set to be quite close. It makes more sense for me to optimize for a specific one, and maybe generalize later if someone has a use-case that isn't covered.

I think an unfortunate potential side-effect of templating might be that it doesn't parse as valid toml, too? That'd make other tools cranky which read their config from pyproject.toml.

Edit: Another way of solving that though is to export pyproject.toml into virtualenvs along with tool blocks too. That sounds more like a stretch goal but awesome consistency.

benjyw commented 2 years ago

To clarify, I think we'd want to configure which template to use, but I suppose we can default to the "closest" (in a filesystem sense) pyproject.toml/setup.py/setup.cfg we find.

The "a template isn't valid yaml" issue is indeed a problem.

Maybe templating is overkill, now that I think of it. Since almost everything other than setuptools uses pyproject.toml, a mapping of "data field to toml field to write that data to" would suffice, and pants could cleanly augment an existing pyproject.toml without it having to contain placeholders. And we can deal with setup.py specially.

bryanwweber commented 2 years ago

I agree about the valid TOML problem as well... This would actually be a problem for pants itself, if one wanted to use the pyproject.toml file to store dependencies and read them in as python_requirements...

To clarify, I think we'd want to configure which template to use

IMO, pants should only support setup.py and pyproject.toml as "we will generate this for you" options, so there should be a minimum of configuration, regardless of template or not. I think generating pyproject.toml and setup.py are actually mutually exclusive, so the machinery in place to generate setup.py can mostly stay as it is, while new code is added to generate pyproject.toml that uses the existing code where it makes sense (reading setup kwargs, for example). (See the asides below for more discussion)

but I suppose we can default to the "closest" (in a filesystem sense) pyproject.toml/setup.py/setup.cfg we find.

Would it be possible/make sense to have the user specify a resource target if they want to add to an existing pyproject.toml file? Otherwise pants writes out a completely configured pyproject.toml? I suspect the most common case is that people will want to store at least dependencies in the relevant pyproject.toml file to read them into pants via #16932 (although maybe lots of people centralize their dependencies?), so they'll already have the file in their repo...

a mapping of "data field to toml field to write that data to" would suffice, and pants could cleanly augment an existing pyproject.toml

Famous last words, but since PEP 621 has named all the relevant fields, this kind of mapping shouldn't be too hard, and the fields in pyproject.toml should be fairly straightforward to augment from data in BUILD files. I think the harder part will be figuring out how to handle the PEP 517 backend configuration, if it's required (for example, to specify the import name if it's different from the distribution name).

I think the way to handle that case is:

If the user hasn't specified a build-system table in their pyproject.toml file, default to setuptools and tell pants how to configure the setuptools build backend, documented here: https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html#setuptools-specific-configuration
If the user does specify a build-system table, and their import name is different from the distribution name, then they also must add a configuration option in their BUILD which is a string template that pants can interpolate with the appropriate package name(s).

That way, pyproject.toml remains valid TOML syntax, but pants doesn't have to learn how to configure every possible backend. We could even document/link to the documentation for several backends to demonstrate how to do 2.

Aside 1

One could argue that `setup.cfg` should also be included, but I think `pyproject.toml` is the standardized future, `setup.py` is the need-to-support-it legacy option, and `setup.cfg` is stuck in the middle... not standardized outside `setuptools`, and not supported by any other build tools AFAIK. So supporting it in pants seems like more work without much payoff... happy to be convinced otherwise 😄

Aside 2

There is only one other build backend I'm aware of that uses a different configuration file, which is `enscons` that uses a `SConstruct` file to actually run the build steps (in some ways analogous to `setup.py`). In that case though, I'd expect that users could turn `generate_setup` or whatever config option to False.

Aside 3

> Since almost everything other than setuptools uses pyproject.toml Since version 61.0.0 ([March 2022](https://pypi.org/project/setuptools/61.0.0/)) setuptools has supported PEP 621 configuration in `pyproject.toml` as well, and no longer requires a `setup.py` file to be present at all. The only thing that potentially requires `setup.py` still (as far as I understand) are editable installs, which pants doesn't do anyways (I don't think); and even that is changing with PEP 660. See: https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html So, in principle, since setuptools supports pyproject.toml, I think the default in pants could change to writing out the `pyproject.toml` with PEP 621 fields and using `setuptools` as the build-backend. _Hopefully_ (assuming the setuptools backend is correct) that wouldn't break anything that was using the standard `python_artifact()` and `python_distribution(generate_setup=True)` configuration. For folks using plugins to generate setup.py arguments, I don't think that'll work seamlessly, so I think there's still a very good reason to continue to support generating the `setup.py` file. There are probably other edge cases here, like `setuptools-scm` that mean changing defaults has to be a longer term project (if ever) though 😄

benjyw commented 2 years ago

Yeah, this all sounds about right to me!

tgolsson commented 2 years ago

Random shower thought: Could a lot of the issues be sidestepped by making pants also be the backend, or at least wrap a backend? Then it's less of a text-management problem and more about building the wheel etc (which could be delegated).

benjyw commented 2 years ago

Random shower thought: Could a lot of the issues be sidestepped by making pants also be the backend, or at least wrap a backend? Then it's less of a text-management problem and more about building the wheel etc (which could be delegated).

I don't know if we'd want Pants to implement a backend, but we could treat the backend it choses to use as an internal implementation detail. But I think the challenge would remain of how to let users provide data, merge it with Pants-generated data, and feed that into a backend?

tgolsson commented 2 years ago

Yes, the problem is the same. It just takes control of more of the stack - the pants build backend could be opinionated enough to use Pants mechanisms for finding files, deps, etc, as opposed to reading it from the pyproject.toml. So all it has to do is ingest the [project] group, and require e.g. dynamic = ['dependencies'] and whatever other fields that Pants should own completely.

pantsbuild / pants

`python_distribution` should generate PEP 517 and PEP 621 `pyproject.toml` files #12487