Open wallace11 opened 2 months ago
Holy smokes, that is weird and bad. Thanks for the report and the repro.
So this isn't quite so bad: this is in fact to do with partitioning. If you look at the output you can see that it's running on src/enums.py
separately from src/file.py
, but if you rename enums.py
then it runs them both together. But it is not doing the same work twice.
Of course why the names should affect the partitioning is a matter still to be looked into...
I tracked this down and the reason this happens is that the partitioning is based on a hash of the file name:
https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L433 https://github.com/pantsbuild/pants/blob/main/src/python/pants/util/collections.py#L102-L152
I converted the configuration in the OP into a reproducer shell script:
#!/bin/bash
set -e
cd "$(mktemp -d)"
echo "Working directory: $(pwd)"
# Create pants.toml
cat >pants.toml <<EOF
[GLOBAL]
pants_version = "2.21.0"
backend_packages = [
"pants.backend.python",
"pants.backend.project_info"
]
[regex-lint]
config = "@regex-lint.yaml"
EOF
# Create regex-lint.yaml
cat >regex-lint.yaml <<EOF
required_matches:
python_file:
- hello
path_patterns:
- name: python_file
pattern: \.py$
content_patterns:
- name: hello
pattern: ^hello
EOF
# Create project structure
mkdir -p src
cat >src/BUILD <<EOF
python_sources()
EOF
# Create empty Python files
touch src/enums.py
touch src/file.py
# Run pants lint
echo "Running 'pants lint ::'"
# This is to run Pants with some custom log messages that I added
PANTS_SOURCE=~/Projects/pants pants lint ::
echo "Reproducer completed."
If I run the reproducer script using the project configuration in the OP, I get the following output:
./reproducer.sh
Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.CmgNoR7YDn
Running 'pants lint ::'
Pantsd has been turned off via Env.
23:02:18.89 [INFO] Emitting new batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/enums.py']
23:02:18.89 [INFO] Emitting final batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/file.py']
23:02:18.89 [INFO] Number of batches per request type: {'RegexLintRequest': 2}
23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello
0 files matched all required patterns.
1 files failed to match at least one required pattern.
23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/enums.py: Didn't match: hello
0 files matched all required patterns.
1 files failed to match at least one required pattern.
However, if I change the name of src/enums.py
to src/new_enums.py
, then I get:
./reproducer.sh
Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.pPg07nnA5u
Running 'pants lint ::'
Pantsd has been turned off via Env.
23:04:13.82 [INFO] Emitting final batch of size 2
23:04:13.82 [INFO] Items in batch: ['src/file.py', 'src/new_enums.py']
23:04:13.82 [INFO] Number of batches per request type: {'RegexLintRequest': 1}
23:04:13.82 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello
X src/new_enums.py: Didn't match: hello
0 files matched all required patterns.
2 files failed to match at least one required pattern.
✕ regex-lint failed.
I don't think there is anything special about the name enums.py
, except that it happens to have hash characteristics that shunt it into a separate partition. I suspect there may be a better way to partition files (based on source roots, maybe?) that is more consistent and doesn't rely on the hash of the file name falling into a specific range.
Good detection @krishnan-chandra! Still unclear to me why we partition at all in this case.
Note that partitioning this way ensures balanced partitions, which has its own advantages. Also, for anyone poking around the code, I think this is to do with "batching" (which is to prevent passing too many files to a single process) rather than "partitioning" (which has to do with files that cannot be in the same process because they require different settings in the underlying tool).
So this is partly a UX issue, we are not making it clear to the user that batching is happening, so it looks like the same work is being done multiple times.
But also, why are we batching at all in the case of two files...
The batching code always runs, no matter how many files there are: https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L430-L448
I suppose we could set some threshold / lower bound (maybe 50-100 files?) below which we don't bother with the batching code at all.
If a single partition is used, I'd expect batching to be skipped.
How does the batching code determine the number of batches? I would have assumed it chooses the smallest number of batches so that each batch will have a max of N files for some N. But I guess not?
It doesn't seem that the batching code necessarily creates a specific number of batches - instead, when iterating over the list of items to batch there is a boundary condition for starting a new batch. The only guarantee made is that the batch size will not be larger than the provided size_max
.
Hrm, so this is probably fine for large numbers of input files, but leads to kinda silly behavior on small numbers of input files...
In any case, important to emphasize that work is not being duplicated. It's just that work is being spread over more processes than is necessary.
Yeah. I think there is no bug here, so we can close this issue if no objections @wallace11. While we could potentially optimize the batching behavior a bit further, I'm not sure that necessarily needs to be done here.
@krishnan-chandra The plugin I was working on while discovering this bug requires getting the entire file tree in one go, so while it might appear as a cosmetic bug (still a bug, though...), it's got some real functional implications.
Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using LintFilesRequest
, which does make assumptions about batching).
Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using
LintFilesRequest
, which does make assumptions about batching).
I believe that your suggestion is not possible.
Linting rule doesn't work if the request is not LintFilesRequest.Batch
.
The bare minimum partition rule that I could get working is:
@rule
async def partition_inputs(
request: MyLintFilesRequest.PartitionRequest,
subsystem: MyLintSubsystem,
) -> Partitions[str, Any]:
if subsystem.skip:
return Partitions()
return Partitions.single_partition(sorted(request.files))
My point is that LintFilesRequest
is just a convenience for linters, and one of the main conveniences it provides is batching. But since you don't want that at all, you can just use one of the underlying constructs.
For example, if you want python files you can do:
await Get(
PythonSourceFiles, PythonSourceFilesRequest(..list of targets.., include_files=True)
)
Or if you just want files in general, regardless of any BUILD file targets, you can use Globs
directly.
A big question is: can your linter run on a subset of the repo (based on the cli arguments) or must it always run on every file?
I now understand what you mean.
I guess it's indeed possible to ditch LintFilesRequest
and use something else, however as you already assumed, this would make it more difficult to use targets.
The plugin I developed is a "health check" that makes sure that certain files that have to be somewhere are indeed there. It can check the entire repository or just a subset of it.
Right now, my plugin is working just fine using LintFilesRequest
and it can work on any target provided to it because I receive that "for free".
Depending on where the cut-off happens, it might interfere with the plugin's behavior (because it's checking nearby files).
All in all, I don't think this is a critical bug, but I personally wouldn't dismiss it as "working as intended" because there's clearly a quirk in the system here.
Hopefully you'd consider an improvement to this mechanism in the future :)
Hmm, I see. So I think your use case is not really a linter. The LintFilesRequest
mechanism envisioned cases where all the information you need to lint a file is in that file (and maybe its transitive dependencies). This seems like something else.
Can you elaborate by example? That is, what are the inputs of your health checker and how do they map to outputs? I have a feeling that doing this with Globs
might be more correct for more than just this reason.
We're digressing a bit so I'll try to keep it concise.
I'm taking the list of files that the lint request provides and build a logical tree out of them. Each leaf directory is checked against a known files schema that's determined by one of its parents.
Thinking about it now, I can probably use something like Get(Paths, PathGlobs(globs=['**']))
, combine both files
and dirs
of the output and that should be identical to what the lint request gives me.
Yes, that is more or less what I was thinking. Let us know how it goes!
Describe the bug If a file has the name
enums.py
anywhere in the repository, thelint
goal for any linter that usesLintFilesRequest
will run twice.I was able to reproduce this in a clean environment with the following tree:
Using
regex-lint
as an example (since it usesLintFilesRequest
):pants.toml
regex-lint.yaml
BUILD
enums.py
&file.py
are empty.Output of
pants lint ::
:Pants version 2.21.0
OS Linux
Additional info At some point I was able to also get a file with the name
rules.py
in a custom plugin to cause this behavior but I'm unable to reliably reproduce this. It might be related to the order at which the files are collected or something along these lines but I haven't looked further.Edit:
regex-lint.yaml
(I previously used the version from my production environment)