`LintFilesRequest` gets confused with a file named `enums.py`

wallace11 commented 2 months ago

Describe the bug If a file has the name enums.py anywhere in the repository, the lint goal for any linter that uses LintFilesRequest will run twice.

I was able to reproduce this in a clean environment with the following tree:

├── pants.toml
├── regex-lint.yaml
└── src
    ├── BUILD
    ├── enums.py
    └── file.py

Using regex-lint as an example (since it uses LintFilesRequest):

pants.toml

[GLOBAL]
pants_version = "2.21.0"
backend_packages = [
    "pants.backend.python",
    "pants.backend.project_info"
]

[regex-lint]
config = "@regex-lint.yaml"

regex-lint.yaml

required_matches:
  python_file:
    - hello

path_patterns:
  - name: python_file
    pattern: \.py$

content_patterns:
  - name: hello
    pattern: ^hello

BUILD

python_sources()

enums.py & file.py are empty.

Output of pants lint :::

12:55:47.66 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/enums.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

12:55:47.66 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

✕ regex-lint failed.

Pants version 2.21.0

OS Linux

Additional info At some point I was able to also get a file with the name rules.py in a custom plugin to cause this behavior but I'm unable to reliably reproduce this. It might be related to the order at which the files are collected or something along these lines but I haven't looked further.

Edit:

Fixed a typo
Fixed output of lint to reflect the stripped-down regex-lint.yaml (I previously used the version from my production environment)

benjyw commented 2 months ago

Holy smokes, that is weird and bad. Thanks for the report and the repro.

benjyw commented 2 months ago

So this isn't quite so bad: this is in fact to do with partitioning. If you look at the output you can see that it's running on src/enums.py separately from src/file.py, but if you rename enums.py then it runs them both together. But it is not doing the same work twice.

Of course why the names should affect the partitioning is a matter still to be looked into...

krishnan-chandra commented 2 months ago

I tracked this down and the reason this happens is that the partitioning is based on a hash of the file name:

https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L433 https://github.com/pantsbuild/pants/blob/main/src/python/pants/util/collections.py#L102-L152

I converted the configuration in the OP into a reproducer shell script:

#!/bin/bash

set -e

cd "$(mktemp -d)"
echo "Working directory: $(pwd)"

# Create pants.toml
cat >pants.toml <<EOF
[GLOBAL]
pants_version = "2.21.0"
backend_packages = [
    "pants.backend.python",
    "pants.backend.project_info"
]

[regex-lint]
config = "@regex-lint.yaml"
EOF

# Create regex-lint.yaml
cat >regex-lint.yaml <<EOF
required_matches:
  python_file:
    - hello
path_patterns:
  - name: python_file
    pattern: \.py$
content_patterns:
  - name: hello
    pattern: ^hello
EOF

# Create project structure
mkdir -p src
cat >src/BUILD <<EOF
python_sources()
EOF

# Create empty Python files
touch src/enums.py
touch src/file.py

# Run pants lint
echo "Running 'pants lint ::'"
# This is to run Pants with some custom log messages that I added
PANTS_SOURCE=~/Projects/pants pants lint ::

echo "Reproducer completed."

If I run the reproducer script using the project configuration in the OP, I get the following output:

./reproducer.sh

Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.CmgNoR7YDn
Running 'pants lint ::'
Pantsd has been turned off via Env.

23:02:18.89 [INFO] Emitting new batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/enums.py']
23:02:18.89 [INFO] Emitting final batch of size 1
23:02:18.89 [INFO] Items in batch: ['src/file.py']
23:02:18.89 [INFO] Number of batches per request type: {'RegexLintRequest': 2}
23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

23:02:18.89 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/enums.py: Didn't match: hello

0 files matched all required patterns.
1 files failed to match at least one required pattern.

However, if I change the name of src/enums.py to src/new_enums.py, then I get:

./reproducer.sh
Working directory: /var/folders/0w/zhpm9fs12zzcmt3m5799xb9c0000gn/T/tmp.pPg07nnA5u
Running 'pants lint ::'
Pantsd has been turned off via Env.
23:04:13.82 [INFO] Emitting final batch of size 2
23:04:13.82 [INFO] Items in batch: ['src/file.py', 'src/new_enums.py']
23:04:13.82 [INFO] Number of batches per request type: {'RegexLintRequest': 1}
23:04:13.82 [ERROR] Completed: Lint with regex patterns - regex-lint failed (exit code 1).
X src/file.py: Didn't match: hello
X src/new_enums.py: Didn't match: hello

0 files matched all required patterns.
2 files failed to match at least one required pattern.

✕ regex-lint failed.

I don't think there is anything special about the name enums.py, except that it happens to have hash characteristics that shunt it into a separate partition. I suspect there may be a better way to partition files (based on source roots, maybe?) that is more consistent and doesn't rely on the hash of the file name falling into a specific range.

benjyw commented 2 months ago

Good detection @krishnan-chandra! Still unclear to me why we partition at all in this case.

Note that partitioning this way ensures balanced partitions, which has its own advantages. Also, for anyone poking around the code, I think this is to do with "batching" (which is to prevent passing too many files to a single process) rather than "partitioning" (which has to do with files that cannot be in the same process because they require different settings in the underlying tool).

benjyw commented 2 months ago

So this is partly a UX issue, we are not making it clear to the user that batching is happening, so it looks like the same work is being done multiple times.

But also, why are we batching at all in the case of two files...

krishnan-chandra commented 2 months ago

The batching code always runs, no matter how many files there are: https://github.com/pantsbuild/pants/blob/main/src/python/pants/core/goals/lint.py#L430-L448

I suppose we could set some threshold / lower bound (maybe 50-100 files?) below which we don't bother with the batching code at all.

wallace11 commented 1 month ago

If a single partition is used, I'd expect batching to be skipped.

benjyw commented 1 month ago

How does the batching code determine the number of batches? I would have assumed it chooses the smallest number of batches so that each batch will have a max of N files for some N. But I guess not?

krishnan-chandra commented 1 month ago

It doesn't seem that the batching code necessarily creates a specific number of batches - instead, when iterating over the list of items to batch there is a boundary condition for starting a new batch. The only guarantee made is that the batch size will not be larger than the provided size_max.

benjyw commented 1 month ago

Hrm, so this is probably fine for large numbers of input files, but leads to kinda silly behavior on small numbers of input files...

In any case, important to emphasize that work is not being duplicated. It's just that work is being spread over more processes than is necessary.

krishnan-chandra commented 1 month ago

Yeah. I think there is no bug here, so we can close this issue if no objections @wallace11. While we could potentially optimize the batching behavior a bit further, I'm not sure that necessarily needs to be done here.

wallace11 commented 1 month ago

@krishnan-chandra The plugin I was working on while discovering this bug requires getting the entire file tree in one go, so while it might appear as a cosmetic bug (still a bug, though...), it's got some real functional implications.

benjyw commented 1 month ago

Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using LintFilesRequest, which does make assumptions about batching).

wallace11 commented 1 month ago

Ah, you are free not to use batching in your own plugin. You can request all the files directly (without using LintFilesRequest, which does make assumptions about batching).

I believe that your suggestion is not possible. Linting rule doesn't work if the request is not LintFilesRequest.Batch. The bare minimum partition rule that I could get working is:

@rule
async def partition_inputs(
    request: MyLintFilesRequest.PartitionRequest,
    subsystem: MyLintSubsystem,
) -> Partitions[str, Any]:
    if subsystem.skip:
        return Partitions()
    return Partitions.single_partition(sorted(request.files))

benjyw commented 1 month ago

My point is that LintFilesRequest is just a convenience for linters, and one of the main conveniences it provides is batching. But since you don't want that at all, you can just use one of the underlying constructs.

For example, if you want python files you can do:

await Get(
        PythonSourceFiles, PythonSourceFilesRequest(..list of targets.., include_files=True)
    )

Or if you just want files in general, regardless of any BUILD file targets, you can use Globs directly.

A big question is: can your linter run on a subset of the repo (based on the cli arguments) or must it always run on every file?

wallace11 commented 1 month ago

I now understand what you mean. I guess it's indeed possible to ditch LintFilesRequest and use something else, however as you already assumed, this would make it more difficult to use targets.

The plugin I developed is a "health check" that makes sure that certain files that have to be somewhere are indeed there. It can check the entire repository or just a subset of it. Right now, my plugin is working just fine using LintFilesRequest and it can work on any target provided to it because I receive that "for free". Depending on where the cut-off happens, it might interfere with the plugin's behavior (because it's checking nearby files).

All in all, I don't think this is a critical bug, but I personally wouldn't dismiss it as "working as intended" because there's clearly a quirk in the system here.

Hopefully you'd consider an improvement to this mechanism in the future :)

benjyw commented 1 month ago

Hmm, I see. So I think your use case is not really a linter. The LintFilesRequest mechanism envisioned cases where all the information you need to lint a file is in that file (and maybe its transitive dependencies). This seems like something else.

Can you elaborate by example? That is, what are the inputs of your health checker and how do they map to outputs? I have a feeling that doing this with Globs might be more correct for more than just this reason.

wallace11 commented 4 weeks ago

We're digressing a bit so I'll try to keep it concise.

I'm taking the list of files that the lint request provides and build a logical tree out of them. Each leaf directory is checked against a known files schema that's determined by one of its parents.

Thinking about it now, I can probably use something like Get(Paths, PathGlobs(globs=['**'])), combine both files and dirs of the output and that should be identical to what the lint request gives me.

benjyw commented 4 weeks ago

Yes, that is more or less what I was thinking. Let us know how it goes!

pantsbuild / pants

`LintFilesRequest` gets confused with a file named `enums.py` #21369