Resolves should automatically include transitive targets, and not restrict 1st party source code

jasondamour commented 3 months ago

Problem

Migrating a monorepo with libraries and services with conflicting version constraints to pants requires so much effort compared to the intuitive pattern in poetry/pip.

For example, imagine the following repository:

flowchart LR
    X(3rd-party-lib)
    subgraph Libraries/
        libA
        libB
    end
    subgraph Services/
        service1
        service2
    end
    libA -.-> libB
    libB -.->|>=1| X
    service1 --> libA
    service2 --> libB
    service2 -.->|==2| X

Sample repo: https://github.com/jasondamour/pants-vs-poetry

This repository is valid in poetry. We can run poetry lock for each service:

service1: has all dependencies from libA, libB, and 3rd-party-lib==3.0 (latest version of 3rd-party-lib)
service2: has all dependencies from libB and 3rd-party-lib==2.0

_All the same concepts should apply to pythonrequirements from requirements.txt and pep-621 compliant pyproject.toml files, I'm just using poetry as an example

For the equivalent in Pants, I need to:

In pants.toml, define a resolve per service and library
Parametrize targets to add ALL libraries to ALL resolves a. Pants struggles to infer dependencies when resolves are enabled, so it also requires a lot of manual dependencies

Solution

Resolves should (optionally) behave more like poetry per-module Virtual Environments.

A "resolve" should just represent version constraints for 3rd party dependencies (including transitive)
Don't impose restrictions on what 1st party code can run with a resolve
Don't require a lockfile for every resolve

For example:

Define resolves:

# pants.toml
[python.resolves]
service1 = "Services/service1/pants.lock"
service2 = "Services/service2/pants.lock"

Set Services/service1/BUILD: poetry_requirements(resolve="service1") to add direct targets of this directory to @resolve=service1 (and repeat for service2)
Declare a dependency (same as vanilla poetry) services/service1/pyproject.toml: libA = {path = "../../Libraries/libA", develop = true}
pants generate-lockfile --resolve=service1 should create a lockfile which pins libA, libB, and 3rd-party-lib
pants test Libraries/libA:: should run without a resolve or lockfile

Describe alternatives you've considered

We may just go back to directly using poetry for all dependencies, and exporting lockfiles for services from poetry for pants to consume in tests. I think that will give us exactly what I'm describing.

tdyas commented 3 months ago

How does Poetry behave if you use poetry lock to generate a poetry.lock lockfile and then try to use two different major versions of a dependency?

jasondamour commented 3 months ago

How does Poetry behave if you use poetry lock to generate a poetry.lock lockfile and then try to use two different major versions of a dependency?

Can you clarify the scenario you're asking about?

jasondamour commented 3 months ago

I've created a sample repo here: https://github.com/jasondamour/pants-vs-poetry

benjyw commented 3 months ago

So in the Poetry equivalent, which lockfile is used when you run tests on, say, LibA?

benjyw commented 3 months ago

When, say, packaging the services it's clear to me what you expect. I'm less sure about what the expected behavior is when you act on the libraries.

benjyw commented 3 months ago

For context, this is very helpful for https://github.com/pantsbuild/pants/discussions/20897

jasondamour commented 3 months ago

So in the Poetry equivalent, which lockfile is used when you run tests on, say, LibA?

It depends what project/directory you're acting from:

➜  poetry run --directory Libraries/libB python Libraries/libB/libb/main.py
flask version: 3.0.3

➜  poetry run --directory Services/service2/ python Libraries/libB/libb/main.py
flask version: 2.3.3

In pants-speak, I guess I would describe this as: using the resolve from the directly invoked target (top of the stacktrace), rather than the resolve of the transitive target?

benjyw commented 3 months ago

I see. And if there are multiple directly invoked targets?

$ pants test Libraries/::

for example.

benjyw commented 3 months ago

In the new python backend I plan to "partition by config", so if there were separate resolves it would be obvious which one pertains.

jasondamour commented 3 months ago

IMO the above statement should expand to something like this:

$ pants test :: # this "expands" to:
pants test Libraries/libA/liba/main_test.py # this test (and all transitive deps) run with @resolve=libA
pants test Libraries/libB/libb/main_test.py # this test (and all transitive deps) run with @resolve=libB
pants test Services/service1/service1/main_test.py # this test (and all transitive deps) run with @resolve=service1
pants test Services/service2/service2/main_test.py # this test (and all transitive deps) run with @resolve=service2

but that would also require all python_sources to be added to all resolves, or resolves shouldn't include sources

In poetry's case specifically, it also feels like a resolve should be 1:1 with source roots. Where all python_requirements within a given source root are added to the same resolve

benjyw commented 3 months ago

Yeah, that makes sense and is roughly along the lines I was thinking of for the new backend. I need to think about whether there is some way to emulate this in the existing backend, I'll play around in your repo a bit.

jasondamour commented 3 months ago

Sounds good, I really appreciate it.

benjyw commented 3 months ago

Hmm, I haven't come up with much. Maybe set enable_resolves = false when running tests, and letting Pants yolo it without a lockfile, and then use the service lockfiles when packaging? It's pretty lame, but you might get some mileage.

TBH this is not a use case that the original backend considered. It was focused more on keeping requirements coherent across a large "JVM-style" tightly-bound monorepo, rather than a collection of standalone projects that are loosely bound (e.g., consume each other as requirements at runtime) and happen to be in the same repo.

I am working on the new backend precisely to address this looser case, to make Pants less nannyish and let you do what you want rather than trying to keep you away from sharp edges.

Sorry I don't have better news except "wait for the new backend"... :(

Again, you may be able to do something by having multiple config files and switching between them depending on what you're doing. Lame, but might tide you over.

benjyw commented 3 months ago

Meanwhile your feedback/participation in https://github.com/pantsbuild/pants/discussions/20897 would be really useful. Linking to this issue and the repo from there would provide a real-world use case to kick the tires on.

jasondamour commented 3 months ago

A new python backend (or pants v3) would be awesome, but is there any option achievable with the current backend? If there was an option where resolves would not be applied to source targets (any source could be used by any target), then I think that would do the job for us. Then "lockfiles" would actually work how they do in every other dependency management tool.

I'm willing to look into contributing this, if you could give some pointers

jasondamour commented 3 months ago

@benjyw Based on this comment, it seems like there might be viable path forward. Pants could completely support/document the concept of None resolve, which is a special resolve (named None or something else like *), which does not generate a lockfile and can be used by any other resolve

I ran pants from source with the default resolve set to None directly in setup.py (to workaround the issues of declaring None in toml), and pants still treated None like a string in a lot of places. I'll keep playing with this

jasondamour commented 3 months ago

Hm, "nullable resolves" alone wouldn't address the issue of inferring a different import owner based on which target is directly invoked.

So to get something working in the current python backend, one approach could be:

A new [python-infer].ambiguity_resolution called by_resolve a. This looks for a dependency defined in the same resolve as the top-level target, then falls back onto by_source_root b. This might have very poor memoization performance, as the graph could be different for each top-level target
[python].default_resolve defaults to None a. [python].enable_resolves defaults to true (could even be deprecated+removed, since the new behavior of defaulting targets to the 'None' resolve is equivalent to running without resolves)

Here's what this would look like in my sample repo:

# pants.toml
[python-infer]
ambiguity_resolution = "by_resolve"

[python.resolves]
service1 = "Services/service1/pants.lock"
service2 = "Services/service2/pants.lock"

Then all targets in Libraries/libA and Libraries/libB default to resolve=None. In each Services/service1/BUILD and Services/service2/BUILD, set the respective resolve, i.e.:

_defaults__(all=dict(resolve="service1))

Then, for pants test Services:: we should get the resulting dependency graphs:

First, tests in Service1 where Flask is not defined in the service1 resolve

Services/service1/service1/main_test.py@resolve=service1 # This is the "top-level resolve" for this graph
-> Services/service1/service1/main.py@resolve=service1  
-> Libraries/libB/libb/main.py[@resolve=None] 
-> Libraries/libb:poetry#Flask

Then, tests in Service2 where Flask is defined directly in the service2 resolve

Services/service2/service2/main_test.py@resolve=service2 # this is the "top-level resolve"
-> Services/service2/service2/main.py@resolve=service2
-> Libraries/libB/libb/main.py[@resolve=None]
-> Services/service2:poetry#Flask@resolve=service2 # since flask exists in the top-level resolve

benjyw commented 3 months ago

A new ambiguity resolution method sounds pretty sensible for this, and if you can get that to work we'd welcome the PR. Let us know how it goes as you continue hacking on this.

jasondamour commented 3 months ago

I threw my POC into this PR, but I ran into some questions (like you mentioned) that I'm not able to answer:

If there are multiple directly invoked targets?
If a directly invokd target belongs to multiple resolves?

IMO this would require a new abstraction, like a "Pants Subproject" , where:

Files in a given directory always run with a certain resolve
Source Root patterns can default to [ '$PANTS_SUBPROJECT_NAME/', 'src/', 'test/' ] which is the poetry default (that would entirely remove the need for marker files for us)
Requirements directly in a subproject override transitive dependencies
- this may be dependent on pants fully parsing transitive dependencies
Pants.toml configuration can be overridden per-subproject
- ik theres discussion about enabling pants.toml configuration overrides at ANY directory level, but IMO thats potentially "too" powerful/flexible, and configuration per-subproject might offer a good compromise

pantsbuild / pants