pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

`isort` behavior depends on which files are available in the sandbox #15069

Open kaos opened 2 years ago

kaos commented 2 years ago

Describe the bug

When running isort on a single source file, the result is not always consistent compared to when running on all sources.

Pants version 2.11.0rc1, 2.12.0.dev1

OS Mac

Additional info

Test case demonstrating this issue in #15002

It seems there are cases even in fmt that requires transitive deps to be present, not only for check. See related comment: https://github.com/pantsbuild/pants/pull/14186#issuecomment-1017816080

I've run across this a couple of times now at work, where I have a pre-commit hook that runs on a subset of files (exactly the subset of changed files only), and fails, where as when I run by hand I usually use globs like a hammer and includes a whole lot more, which exposes this issue, due to isort not being stable for the two ways it is being invoked.

stuhood commented 2 years ago

One thing to check is whether it needs transitive dependencies ("everything recursively") or just direct dependencies (things literally mentioned in the imports or explicitly in BUILD files): it seems like given that it is sorting the imports, there is a distinct possibility that it only requires direct dependencies.

But yea: we should definitely fix this. Accuracy is more important than performance.

kaos commented 2 years ago

Ah, yes. Direct dependencies should be enough.

thejcannon commented 2 years ago

Adding my 2 cents: we should instead seed isort's config from source_roots. I'd prefer if formatters don't require dependencies

kaos commented 2 years ago

That would be better. I've not found how to configure my way out of this, so any help with that is appreciated.

Not urgent, so whenever time permits works for me.

benjyw commented 2 years ago

I've dug into this. It's not that it needs dependencies, it actually may need the entire source root. The reason is that isort has subtle logic to deal with namespace packages, presumably so that if foo.bar and foo.qux are published from two separate repos, it can treat foo.qux as third-party in foo.bar's repo and vice versa.

So if we only have foo/bar/bar.py in the sandbox, and that file imports from foo.qux, isort will classify foo.qux as third-party, because its namespace package detection heuristic sees no .py files in foo, and assumes that foo is a namespace package.

But if some foo/baz.py happens to also be in the sandbox, then isort sees foo as a non-namespace package, and classifies it as first-party (which is what you want).

Note that foo/baz.py may not be a transitive dep of foo/bar/bar.py, so bringing in all deps does not help here.

So we can either:

A) Bring in the entire source root B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.

A seems like massive overkill. So, B I guess? I slightly fear unintended consequences, but it's probably fine since those are real files in the repo, that could have been brought in some other way anyway.

kaos commented 2 years ago

Thanks @benjyw!

B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.

+1 to go with option B. I guess the __init__.py could be that representative? And as such, in most cases should be very light weight as well.

thejcannon commented 2 years ago

@benjyw just curious, have you looked into the config route?

thejcannon commented 2 years ago

Nevermind just saw your comment about how the config is wrong for namespace packages

Eric-Arellano commented 2 years ago

Thanks for digging into this Benjy, and reporting on it Andreas. For context, this was known when isort 5 first came out and our docs call attention to it: https://www.pantsbuild.org/docs/python-linters-and-formatters#isort-possible-issues-with-its-import-classifier-algorithm

But our hand-wavey workaround of config files isn't robust, it sounds like. It's also a total gotcha.

benjyw commented 2 years ago

I suspect that that only worked because isort's heuristics respond to the presence of pyproject.toml, regardless of its content.

huonw commented 2 years ago

For isort specifically, would it be feasible to specify first party modules via the --project arg ("known first party"), rather than simulate the repo structure on disk? This would effectively be using pants' understanding of what's first party rather than relying on isort to guess it.

In our project we were having similar issues with import foo and import foo.bar being randomly split across different sections and found that we could acceptably work around it by setting default_section = "FIRSTPARTY" to just lump all first party and third party packages into a section together.

benjyw commented 2 years ago

@huonw Well, you can certainly set that in an isort config file, and Pants will use that config file. But it sounds like you're suggesting generating config?

huonw commented 2 years ago

Yeah, I'm wondering if, for instance, the following function could be expanded to include something like args.extend(f"-p={module}" for module in magically_list_all_python_modules()) (I definitely have no idea what's possible within Pants, and that may make too-long argument vectors).

https://github.com/pantsbuild/pants/blob/161bbbe0964539e506ab9015a49c783b862cab91/src/python/pants/backend/python/lint/isort/rules.py#L41-L61

kaos commented 2 years ago

This hit me again in another project.. ;)

And yea, the "known first party" config option seems like a reasonable way to resolve this. Putting it in a generated isort config rather than as args on the command line could help avoid issues with too long command lines as well.

Eric-Arellano commented 2 years ago

I thought we were saying the config approach is not viable for namespace packages? @kaos

If config is viable, then I strongly prefer we auto-setup config for you rather than including all the projects in the sandbox. The latter would be substantially slower, and I think not properly tracked with --changed-since; the output of fmt depends on what other files are in the sandbox.

kaos commented 2 years ago

I'm not sure if I tried with the "known first party" config option. I only tried the "src paths" one, from what I can read in my issue/PR.

The suggested approach by Benjy I think was to add empty __init__.py files (or similar) as placeholders to aid the isort heuristics into coming to correct conclusions, so wouldn't need all the projects in the sandbox, fwict.

benjyw commented 2 years ago

If I understand correctly, doing this automatically might break the foo.bar / foo.qux case I mentioned above (where foo.qux is third-party in foo.bar's repo and foo is a namespace package). We don't know what other repos are out there, so we can't assume that everything under foo is first-party.

benjyw commented 2 years ago

We may have to start by documenting this and offering the workaround of setting the firstparty package explicitly in isort config.

thejcannon commented 2 years ago

where foo.qux is third-party in foo.bar's repo and foo is a namespace package

To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?

benjyw commented 2 years ago

To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?

This is the crux of the issue, as I understand it. So I think it's the right amount of weeds...

thejcannon commented 2 years ago

Well specifically I mean if "foo.bar" is "third party" when we talk about "foo.baz". They share an ancestor, so presumably they're in the same party, albeit different packages and codebases

benjyw commented 2 years ago

True, but Pants doesn't care about whether the same people wrote it or own the IP or whatever. It cares about how code is consumed. The salient point here is that foo.bar is consumed as a requirement. Perhaps "external" and "internal" would be better terminology in general, but keep in mind that isort itself uses the firstparty/thirdparty terminology in its config etc, so in this case it is actually the right term to use.

thejcannon commented 2 years ago

Right, my point, outside of Pants, is "are packages outside of this one, but still in your namespace considered first party or third-party?" And then slap an isort lens on the question as well

thejcannon commented 1 year ago

I'm reading over this again, and I'm struggling to grasp the namespace package issue as described above.

@benjyw can you dump a reproduction of the isort issue with namespaces (not using Pants) so we can test a Pants change against this corner case?

thejcannon commented 1 year ago

OK, so good news and bad news:

Good news: This is easily fixable in a not-nasty way. Essentially specify every import we're going to provide as a firstparty module. It makes sense, and works around the namespace issue. This works purely on the argument level.

Bad news: It requires a bugfix to isort: https://github.com/PyCQA/isort/pull/2168

benjyw commented 1 year ago

Re specify every import we're going to provide as a firstparty module - we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?

benjyw commented 1 year ago

Setting up a test case repo is a good idea though, I will try and do that asap

thejcannon commented 1 year ago

we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?

Not dep inference, but Pants wholistically. Whether it's inferred or explicit, if we're providing a Python module via a file on disk, that must be firstparty. That's what I like about this. It's super simple.

benjyw commented 1 year ago

We literally already have this logic in dep inference though. Seems like we can consume that mapping and save ourselves some work.

thejcannon commented 1 year ago

We don't need the mapping though, we just need the dependencies were about to provide 🙂 (which is a product of the mapping)

benjyw commented 1 year ago

As I mentioned in https://github.com/pantsbuild/pants/issues/15069#issuecomment-1141599319 - looking at all deps (even transitively) is not sufficient. So can you explain what "specify every import we're going to provide as a firstparty module" means?

thejcannon commented 1 year ago

Basically, if an import is being provided by Pants, that's first party, everything else if third party

benjyw commented 1 year ago

ELI5? I don't know what "import id" means here.

thejcannon commented 1 year ago

Sorry fat fingered it on mobile: "import is".

Basically we just tell isort what is first party. And we know that because we know what first party (direct) dependencies we provide.

I think it's probably worth just demonstrating via a reproduction/test, especially since namespace packages are in the mix

benjyw commented 1 year ago

Yeah, I fear it's a little more subtle than that, but it's on me to make a repro of the problem.

thejcannon commented 1 year ago

OK, I wanted this off my stack, so I decided to jump in:

josh@cephandrius:~/work/splash$ tree .
.
├── bad
│   ├── foo
│   │   └── bar
│   │       ├── bar.py
│   │       └── __init__.py
│   └── run_isort.sh
└── good
    ├── foo
    │   ├── bar
    │   │   ├── bar.py
    │   │   └── __init__.py
    │   └── baz.py
    └── run_isort.sh

6 directories, 7 files

Files are identical across tree, with:

josh@cephandrius:~/work/splash$ cat bad/foo/bar/bar.py 
import foo.qux
import requests
josh@cephandrius:~/work/splash$ cat bad/run_isort.sh 
#!/usr/bin/env bash

cd "$(dirname "$0")"
pipx run isort foo/bar/bar.py "$@"

Then running each:

josh@cephandrius:~/work/splash$ ./bad/run_isort.sh 
josh@cephandrius:~/work/splash$ ./good/run_isort.sh 
Fixing /home/josh/work/splash/good/foo/bar/bar.py

and for contents:

josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py 
--- bad/foo/bar/bar.py  2023-09-08 09:52:26.240101161 -0500
+++ good/foo/bar/bar.py 2023-09-08 09:53:32.650789329 -0500
@@ -1,2 +1,3 @@
-import foo.qux
 import requests
+
+import foo.qux

So @benjyw your theory holds.


Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=<modname>")

josh@cephandrius:~/work/splash$ ./bad/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/bad/foo/bar/bar.py
josh@cephandrius:~/work/splash$ ./good/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/good/foo/bar/bar.py
josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py 

So, as expected. We know what's firstparty because we're the all-seeing-eye of firstparty/thirdparty

huonw commented 1 year ago

Reproducer in script form, for easier reproduction on other machines:

cd $(mktemp -d)

# isort setup
python -m venv venv
. venv/bin/activate
pip install isort==5.12.0

# configure the directory structure
mkdir -p foo/bar/
touch foo/bar/__init__.py
cat > foo/bar/bar.py <<EOF
import foo.qux
import requests
EOF

echo '"BUG": foo.qux is treated as third-party'
python -m isort foo/bar/bar.py
cat foo/bar/bar.py

echo 'Proposed fix: explicit -p arg'
python -m isort foo/bar/bar.py -p=foo.qux
cat foo/bar/bar.py

echo 'Proposed fix matches behaviour when the file actually exists'
touch foo/qux.py
python -m isort foo/bar/bar.py
cat foo/bar/bar.py

Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=")

(FWIW, this is a independent reinvention of a suggestion slightly earlier in the thread, e.g. https://github.com/pantsbuild/pants/issues/15069#issuecomment-1174460553)

thejcannon commented 1 year ago

Yup I saw that comment after having the same thought. Great minds think alike?

I have the code in a branch. I need to turn this reproduction into a test on main so that we can enshrine the quirk, and the fix, forever

benjyw commented 1 year ago

So I've made a repro of the subtle problem, and I think I've convinced myself that it's not subtle after all, and that pulling direct deps into the sandbox will be sufficient to solve the problem without mucking around with config:

https://github.com/benjyw/isort-issue

I got tangled up in the idea that a file outside your deps could change the classification, which is true. But it's also true that it can only change it in the same direction as a direct dep would. So...

chadrik commented 2 months ago

Hi, what's the status of this issue? Did https://github.com/pantsbuild/pants/pull/19796 fix this, as the description there implies?

This change also fixes https://github.com/pantsbuild/pants/issues/17403 (which really is a duplicate of https://github.com/pantsbuild/pants/issues/15069 manifest in new and exciting ways) by de-duplicating the fix-specific batching strategy. It didn't sit right to me that in the reproduction repo we did 1 batches of 2 files for fix, but 1 batch of 4 files for lint.

noahskelton commented 2 months ago

I solved this at least temporarily by increasing batch size to larger than the repo, to get consistent results. pants lint --lint-batch-size=100000 This is using ruff as a formatter and linter.

benjyw commented 2 months ago

Unfortunately isort doesn't play nicely with sandboxing, so the best fix is to manually add your top-level package(s) to the "known first party" isort config. That is more robust than attempting to finesse the batches.