Open kaos opened 2 years ago
One thing to check is whether it needs transitive dependencies ("everything recursively") or just direct dependencies (things literally mentioned in the imports or explicitly in BUILD files): it seems like given that it is sorting the imports, there is a distinct possibility that it only requires direct dependencies.
But yea: we should definitely fix this. Accuracy is more important than performance.
Ah, yes. Direct dependencies should be enough.
Adding my 2 cents: we should instead seed isort
's config from source_roots
. I'd prefer if formatters don't require dependencies
That would be better. I've not found how to configure my way out of this, so any help with that is appreciated.
Not urgent, so whenever time permits works for me.
I've dug into this. It's not that it needs dependencies, it actually may need the entire source root. The reason is that isort has subtle logic to deal with namespace packages, presumably so that if foo.bar
and foo.qux
are published from two separate repos, it can treat foo.qux
as third-party in foo.bar
's repo and vice versa.
So if we only have foo/bar/bar.py
in the sandbox, and that file imports from foo.qux
, isort will classify foo.qux
as third-party, because its namespace package detection heuristic sees no .py
files in foo
, and assumes that foo
is a namespace package.
But if some foo/baz.py
happens to also be in the sandbox, then isort sees foo
as a non-namespace package, and classifies it as first-party (which is what you want).
Note that foo/baz.py
may not be a transitive dep of foo/bar/bar.py
, so bringing in all deps does not help here.
So we can either:
A) Bring in the entire source root B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.
A seems like massive overkill. So, B I guess? I slightly fear unintended consequences, but it's probably fine since those are real files in the repo, that could have been brought in some other way anyway.
Thanks @benjyw!
B) Special-case this, and bring into the sandbox a representative .py file from every parent dir, if there is one.
+1 to go with option B. I guess the __init__.py
could be that representative? And as such, in most cases should be very light weight as well.
@benjyw just curious, have you looked into the config route?
Nevermind just saw your comment about how the config is wrong for namespace packages
Thanks for digging into this Benjy, and reporting on it Andreas. For context, this was known when isort 5 first came out and our docs call attention to it: https://www.pantsbuild.org/docs/python-linters-and-formatters#isort-possible-issues-with-its-import-classifier-algorithm
But our hand-wavey workaround of config files isn't robust, it sounds like. It's also a total gotcha.
I suspect that that only worked because isort's heuristics respond to the presence of pyproject.toml, regardless of its content.
For isort specifically, would it be feasible to specify first party modules via the --project
arg ("known first party"), rather than simulate the repo structure on disk? This would effectively be using pants' understanding of what's first party rather than relying on isort to guess it.
In our project we were having similar issues with import foo
and import foo.bar
being randomly split across different sections and found that we could acceptably work around it by setting default_section = "FIRSTPARTY"
to just lump all first party and third party packages into a section together.
@huonw Well, you can certainly set that in an isort config file, and Pants will use that config file. But it sounds like you're suggesting generating config?
Yeah, I'm wondering if, for instance, the following function could be expanded to include something like args.extend(f"-p={module}" for module in magically_list_all_python_modules())
(I definitely have no idea what's possible within Pants, and that may make too-long argument vectors).
This hit me again in another project.. ;)
And yea, the "known first party" config option seems like a reasonable way to resolve this. Putting it in a generated isort config rather than as args on the command line could help avoid issues with too long command lines as well.
I thought we were saying the config approach is not viable for namespace packages? @kaos
If config is viable, then I strongly prefer we auto-setup config for you rather than including all the projects in the sandbox. The latter would be substantially slower, and I think not properly tracked with --changed-since
; the output of fmt
depends on what other files are in the sandbox.
I'm not sure if I tried with the "known first party" config option. I only tried the "src paths" one, from what I can read in my issue/PR.
The suggested approach by Benjy I think was to add empty __init__.py
files (or similar) as placeholders to aid the isort heuristics into coming to correct conclusions, so wouldn't need all the projects in the sandbox, fwict.
If I understand correctly, doing this automatically might break the foo.bar
/ foo.qux
case I mentioned above (where foo.qux
is third-party in foo.bar
's repo and foo is a namespace package). We don't know what other repos are out there, so we can't assume that everything under foo
is first-party.
We may have to start by documenting this and offering the workaround of setting the firstparty package explicitly in isort config.
where foo.qux is third-party in foo.bar's repo and foo is a namespace package
To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?
To me, we're starting to get a little bit in the weeds on third-vs-first party, but point taken. Perhaps there's an implementation which uses the most-specific (i.e longest path) "superset" of modules?
This is the crux of the issue, as I understand it. So I think it's the right amount of weeds...
Well specifically I mean if "foo.bar" is "third party" when we talk about "foo.baz". They share an ancestor, so presumably they're in the same party, albeit different packages and codebases
True, but Pants doesn't care about whether the same people wrote it or own the IP or whatever. It cares about how code is consumed. The salient point here is that foo.bar
is consumed as a requirement. Perhaps "external" and "internal" would be better terminology in general, but keep in mind that isort itself uses the firstparty/thirdparty terminology in its config etc, so in this case it is actually the right term to use.
Right, my point, outside of Pants, is "are packages outside of this one, but still in your namespace considered first party or third-party?" And then slap an isort lens on the question as well
I'm reading over this again, and I'm struggling to grasp the namespace package issue as described above.
@benjyw can you dump a reproduction of the isort issue with namespaces (not using Pants) so we can test a Pants change against this corner case?
OK, so good news and bad news:
Good news: This is easily fixable in a not-nasty way. Essentially specify every import we're going to provide as a firstparty module. It makes sense, and works around the namespace issue. This works purely on the argument level.
Bad news: It requires a bugfix to isort
: https://github.com/PyCQA/isort/pull/2168
Re specify every import we're going to provide as a firstparty module
- we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?
Setting up a test case repo is a good idea though, I will try and do that asap
we'd have to know it is in fact a firstparty module. I guess we do know this because dep inference knows this?
Not dep inference, but Pants wholistically. Whether it's inferred or explicit, if we're providing a Python module via a file on disk, that must be firstparty. That's what I like about this. It's super simple.
We literally already have this logic in dep inference though. Seems like we can consume that mapping and save ourselves some work.
We don't need the mapping though, we just need the dependencies were about to provide 🙂 (which is a product of the mapping)
As I mentioned in https://github.com/pantsbuild/pants/issues/15069#issuecomment-1141599319 - looking at all deps (even transitively) is not sufficient. So can you explain what "specify every import we're going to provide as a firstparty module" means?
Basically, if an import is being provided by Pants, that's first party, everything else if third party
ELI5? I don't know what "import id" means here.
Sorry fat fingered it on mobile: "import is".
Basically we just tell isort what is first party. And we know that because we know what first party (direct) dependencies we provide.
I think it's probably worth just demonstrating via a reproduction/test, especially since namespace packages are in the mix
Yeah, I fear it's a little more subtle than that, but it's on me to make a repro of the problem.
OK, I wanted this off my stack, so I decided to jump in:
josh@cephandrius:~/work/splash$ tree .
.
├── bad
│ ├── foo
│ │ └── bar
│ │ ├── bar.py
│ │ └── __init__.py
│ └── run_isort.sh
└── good
├── foo
│ ├── bar
│ │ ├── bar.py
│ │ └── __init__.py
│ └── baz.py
└── run_isort.sh
6 directories, 7 files
Files are identical across tree, with:
josh@cephandrius:~/work/splash$ cat bad/foo/bar/bar.py
import foo.qux
import requests
josh@cephandrius:~/work/splash$ cat bad/run_isort.sh
#!/usr/bin/env bash
cd "$(dirname "$0")"
pipx run isort foo/bar/bar.py "$@"
Then running each:
josh@cephandrius:~/work/splash$ ./bad/run_isort.sh
josh@cephandrius:~/work/splash$ ./good/run_isort.sh
Fixing /home/josh/work/splash/good/foo/bar/bar.py
and for contents:
josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py
--- bad/foo/bar/bar.py 2023-09-08 09:52:26.240101161 -0500
+++ good/foo/bar/bar.py 2023-09-08 09:53:32.650789329 -0500
@@ -1,2 +1,3 @@
-import foo.qux
import requests
+
+import foo.qux
So @benjyw your theory holds.
Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=<modname>
")
josh@cephandrius:~/work/splash$ ./bad/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/bad/foo/bar/bar.py
josh@cephandrius:~/work/splash$ ./good/run_isort.sh -p=foo.qux
Fixing /home/josh/work/splash/good/foo/bar/bar.py
josh@cephandrius:~/work/splash$ diff -u bad/foo/bar/bar.py good/foo/bar/bar.py
So, as expected. We know what's firstparty because we're the all-seeing-eye of firstparty/thirdparty
Reproducer in script form, for easier reproduction on other machines:
cd $(mktemp -d)
# isort setup
python -m venv venv
. venv/bin/activate
pip install isort==5.12.0
# configure the directory structure
mkdir -p foo/bar/
touch foo/bar/__init__.py
cat > foo/bar/bar.py <<EOF
import foo.qux
import requests
EOF
echo '"BUG": foo.qux is treated as third-party'
python -m isort foo/bar/bar.py
cat foo/bar/bar.py
echo 'Proposed fix: explicit -p arg'
python -m isort foo/bar/bar.py -p=foo.qux
cat foo/bar/bar.py
echo 'Proposed fix matches behaviour when the file actually exists'
touch foo/qux.py
python -m isort foo/bar/bar.py
cat foo/bar/bar.py
Now, let's reset everything and try again with my proposed fix (which is "if Pants could provide the direct dependency as a file, we pass -p=
")
(FWIW, this is a independent reinvention of a suggestion slightly earlier in the thread, e.g. https://github.com/pantsbuild/pants/issues/15069#issuecomment-1174460553)
Yup I saw that comment after having the same thought. Great minds think alike?
I have the code in a branch. I need to turn this reproduction into a test on main
so that we can enshrine the quirk, and the fix, forever
So I've made a repro of the subtle problem, and I think I've convinced myself that it's not subtle after all, and that pulling direct deps into the sandbox will be sufficient to solve the problem without mucking around with config:
https://github.com/benjyw/isort-issue
I got tangled up in the idea that a file outside your deps could change the classification, which is true. But it's also true that it can only change it in the same direction as a direct dep would. So...
Hi, what's the status of this issue? Did https://github.com/pantsbuild/pants/pull/19796 fix this, as the description there implies?
This change also fixes https://github.com/pantsbuild/pants/issues/17403 (which really is a duplicate of https://github.com/pantsbuild/pants/issues/15069 manifest in new and exciting ways) by de-duplicating the fix-specific batching strategy. It didn't sit right to me that in the reproduction repo we did 1 batches of 2 files for fix, but 1 batch of 4 files for lint.
I solved this at least temporarily by increasing batch size to larger than the repo, to get consistent results.
pants lint --lint-batch-size=100000
This is using ruff as a formatter and linter.
Unfortunately isort doesn't play nicely with sandboxing, so the best fix is to manually add your top-level package(s) to the "known first party" isort config. That is more robust than attempting to finesse the batches.
Describe the bug
When running
isort
on a single source file, the result is not always consistent compared to when running on all sources.Pants version 2.11.0rc1, 2.12.0.dev1
OS Mac
Additional info
Test case demonstrating this issue in #15002
It seems there are cases even in
fmt
that requires transitive deps to be present, not only forcheck
. See related comment: https://github.com/pantsbuild/pants/pull/14186#issuecomment-1017816080I've run across this a couple of times now at work, where I have a pre-commit hook that runs on a subset of files (exactly the subset of changed files only), and fails, where as when I run by hand I usually use globs like a hammer and includes a whole lot more, which exposes this issue, due to
isort
not being stable for the two ways it is being invoked.