pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.29k stars 628 forks source link

Add generic support for parametrizing metadata like interpreter constraints and resolves #13882

Closed Eric-Arellano closed 2 years ago

Eric-Arellano commented 2 years ago

Status quo

It's useful to run against the same target but with slightly different config. For example, run the same python_test with Python 2 and Python 3 to validate it works with both. This feature is a major benefit of Tox, for example - we know it has value.

In particular, it often makes sense to have permutations here:

To do that, you currently have to define two targets:

python_test(name="tests_py2", source="tests.py", interpreter_constraints=["==2.7.*"])
python_test(name="tests_py3", source="tests.py", interpreter_constraints=["==3.6.*"])

Then, you can tell Pants to either run :tests_py2, :tests_py3, or use globs like a file spec to run over both.

A major downside of this approach is duplication. Another major downside is it results in ambiguity with dependency inference because two targets export the same file.

First-class parametrization

Instead, @benjyw @stuhood and I discussed a syntax like this:

shunit2_test(name="tests", source="tests.sh", shells=["bash", "zsh"])

Which would result in running the same tests one time per shell. Or for interpreter constraints, one test run per "named" interpreter constraint (https://github.com/pantsbuild/pants/issues/12652).

This would require reworking the Target API and Rules API interaction, which is currently very premised on one target -> one "build". For example, FieldSet is uniquely identified by its target's Address and our generic core/goals/test.py is not capable of creating two PytestFieldSets for the same Target.

"Configurations"?

Each target type may have multiple parametrizable fields, e.g. for java_test: compatible_resolves, scala_versions, java_versions. One idea is to have users create a single config that sets all of these values into a single addressable(?) entity. Then the target simply points to which config to use.

(I'm not sure what the benefit of doing this would be?)

benjyw commented 2 years ago

(I'm not sure what the benefit of doing this would be?)

The benefit would be that partitioning logic works without having to know about all the axes it has to partition on.

Alternatively, you'd have to partition according to the cross-product of all parameters, which can have a cardinality much greater than the actual set of used configurations. Especially since in many cases there may be correlations between the parameters (e.g., the interpreter constraint for Python2 is strongly correlated with the resolve for Python2).

stuhood commented 2 years ago

In the engine experiment, @jsirois had syntax for selecting configurations from dependees: address@key1=value1,key2=value2. That could also be used here to uniquely identify one configuration of a target for the purposes of running multiple configurations in parallel: i.e.:

✓ src/python/pants/backend/scala/goals/tailor_test.py:tests@interpreter=py2 succeeded.
✓ src/python/pants/backend/scala/goals/tailor_test.py:tests@interpreter=py3 succeeded.

Where the various configurations are actually defined is an open question for me though, and is very related to #13767.

kaos commented 2 years ago

Just throwing this out there....

Thinking what it could look like, taking inspiration from what pytest.mark.parametrize does for test functions.


shunit2_test(name="tests", source="tests.sh", shell=parametrize("bash", "zsh"))

# above generates the equiv of => 

shunit2_test(name="tests#shell-1", source="tests.sh", shell="bash")
shunit2_test(name="tests#shell-2", source="tests.sh", shell="zsh")

That is, a new parametrize construct, that turns any target into a target generator for the same, but with a certain field replaced for each instance.

And for the python_test case it could look like:

python_test(name="tests", source="tests.py", interpreter_constraints=parametrize(["==2.7.*"], ["==3.6.*"]))

# ==>

python_test(name="tests#interpreter_constraints-1", source="tests.py", interpreter_constraints=["==2.7.*"])
python_test(name="tests#interpreter_constraints-2", source="tests.py", interpreter_constraints=["==3.6.*"])

Not sure how to support parametrizing on multiple fields, or if the target itself also already is a target generator, etc..

thejcannon commented 2 years ago

This matters less for monorepo, internal, company code which usually isn't published. But most published libraries are tested over multiple versions of Python (including multiple versions of Python 3). At my last company we wanted to test across all major supported py3 versions. So this would help simplify that.

However, v2 macros do solve the duplication issue and although macros are the Pants step-child, it raises no further issues other than "yuck macros" :wink: (and all that entails). Users get to control the name which is nice.


I like @stuhood 's proposal of a different naming scheme for this as well as @kaos 's for parameterization. Although just to throw out this wild syntax:

@parameterize("shell", ["bash", "zsh"])
shunit2_test(name="tests", source="tests.sh")

Additionally, also throwing out there naming scheme could mirror pytest's (using square brackets): rc/python/pants/backend/scala/goals/tailor_test.py:tests[interpreter=py2]

tdyas commented 2 years ago

Would these parameters include implicit parameters like the execution and target platforms?

I've been playing around in my mind with the concept of a "build context" to make the platform explicit in build rules. An initial use case would be cross-compilation of Linux Go binaries from macOS for use in Docker images (which I saw on Slack a user had tried to do).

I'd love your feedback on how much a "build context" concept overlaps with the parameters idea in this issue.

A few thoughts:

  1. Should there be an explicit BuildContext type in the standard rules which is then a field in every rule type that does builds? BuildContext at the very least would have execution_platform and target_platform fields.
  2. Could the engine support scoped overrides of singletons so that a rule could just call for the current BuildContext as a parameter if it differed from some outer BuildContext? An example use case is a go_binary that has an explicit (yet to be created) target_platform field to set an explicit target platform to build for. (This could actually be a form of multi-type dispatch for Get selectors. E.g., await Get(BuiltPackage, PackageFieldSet, some_field_set, build_context=BuildContext(...).)
tdyas commented 2 years ago

Is there a design doc for this issue?

kaos commented 2 years ago

I love @tdyas idea with a BuildContext, to gather together relevant information about what to build for. Currently there are disconnected fields and options hinting at what is required in terms of interpreter constraints (for Python) and platforms.

I think it could make sense to be able to provide these details as a form of "build input", where you can say: I want to to build for "IC pypy>=3.8 and platform x" and have that propagate to all considered targets, where applicable.

Akin to named resolves, this could be with named build contexts, so you can setup ahead of time the set of combinations you want to support, then selecting one or more of them to build for. Also, a target could specify compatible build contexts, so it doesn't build/use something that is known not to work.

Say, a docker_image could perhaps provide something like a compatibile_build_platform="linux-x86-64-cp-38-cp38" or compatbile_named_build_context="build-context-3.8". Then, when building the image, if no build context is explicitly selected, it could select one that is compatible with what is going to be required, for the upstream targets it depends on.

stuhood commented 2 years ago

2. Could the engine support scoped overrides of singletons so that a rule could just call for the current BuildContext as a parameter if it differed from some outer BuildContext?

If it will be overridden at some points in the graph, then it shouldn't be a singleton: rather, a Param specified at the root of the graph and overridden in subgraphs. But yea: it's effectively equivalent, and that is what has been proposed for the Platform: see the Multi-Platform Speculation design doc.

But to your first question: "no": I don't think that you need a BuildContext "$deity object" that holds all of the context fields. The engine allows for injection of as many or as few parameters as are actually needed in some scope (by design, precisely for this case), and so if you use one less param then some other subgraph (say: you don't consume the Platform), then you can have a smaller cache key and be platform independent, even though your callers weren't.

I think that the implementation sketch on Multi-Platform Speculation is still up to date, and I put a fair amount of thought into it at the time. Having multiple Params that make up your "build context" in your subgraph requires one (hopefully small) engine feature, which would reduce/remove the need for multi-field Request objects (or something like a BuildContext): #7490.


So two things to consider here:

  1. what the UX of this unification of config is for end users, and how it makes things easier for them (which is really the primary goal of this ticket I think). In that case, having a named context/"configuration" might make sense.
  2. how callers provide parameters to callees in the engine (which could be completely independent from the label syntax, by e.g., eagerly converting labels into typed Params which are then used or not used by callees). Global/named contexts probably won't make sense here.
tdyas commented 2 years ago

Another use for parameters: specifying which codegen backend to use: see (4) at https://github.com/pantsbuild/pants/issues/14041

stuhood commented 2 years ago

Prior art which is similar to any sort of local parameterized decorator or syntax: bazel's select macro.

But very relevant is that parameterization or select are local-only solutions (see "You can improve on this solution..." here, which motivates bazel's toolchains): I'm not sure how they could easily be applied via #13767.


"Toolchains" to define the bulk of the arguments to a target seem closer to what we want, possibly with a local-only "multiplex across all of these toolchains" syntax (á la parameterization) exclusive to the named/addressed toolchain. But toolchains are quite a bit like a "$deity object" (BuildContext, as @tdyas described it)... exploding the toolchain out into a bunch of component Fields while constructing the multiplexed targets could solve that...?

stuhood commented 2 years ago

Here is a draft design for this feature: https://docs.google.com/document/d/1mzZWnXiE6OkgMH_Dm7WeA3ck9veusQmr_c6GWPb_tsQ/edit ... use a pants-devel@ member email for comment access.

Feedback welcome!

stuhood commented 2 years ago

Some preliminary docs were added here: https://www.pantsbuild.org/v2.11/docs/targets#parametrizing-targets ... edits very welcome!

I'm out for a few days, but when I get back I'll add a list of open items here from #14408 that block closing this ticket.

kaos commented 2 years ago

Can't help but notice that my original idea would actually be valid now (cool 😁 )

shunit2_test(name="tests", source="tests.sh", shell=parametrize("bash", "zsh"))

stuhood commented 2 years ago

A few issues encountered in https://github.com/pantsbuild/example-jvm/pull/12, which probably block resolving this:

Followup work would be:

stuhood commented 2 years ago

Have updated the blockers in the above comment. We'll want to close this issue for 2.11.x, since it's important for multiple-resolves.

stuhood commented 2 years ago

14521 is the last blocker for 2.11.x: should be able to start it tomorrow. Let's switch to tracking that one.

Thanks to everyone for the feedback and design assistance!