r-lib / lintr

Static Code Analysis for R
https://lintr.r-lib.org
Other
1.19k stars 184 forks source link

Better support for linting a general vector of files #2135

Open MichaelChirico opened 1 year ago

MichaelChirico commented 1 year ago

Sometimes, we want to lint a vector of files that's not neatly captured by lint_dir().

Currently, there's three ways to do so:

  1. Use lint_dir() with non-default patterns=. This is not very flexible -- patterns= is passed to dir(patterns=), which only operates on basename().
  2. manually loop over the vector with lint(). This sacrifices several things handled by lint_dir(), like a progress meter, possible parallelization (#485), flatten_lints() merging results, etc.
  3. Use a .lintr config to set exclusions. This works well for simple packages/projects, but is not fully general and very clunky for interactive use cases.

One use case I have in mind is linting the R sources. lint_dir() includes lots of files that we'd typically not want to lint, e.g. test scripts for the R parser that intentionally use bad R syntax, and the concatenated source all.R files. It's very hard to get lint_dir() to ignore those all.R files especially, which is very inefficient, so the best way to lint the R sources currently is to filter list.files() by hand, then lintr:::flatten_lints(lapply(filtered_files, lint, ...)).

I can see ~two~ three (3rd thanks to @JSchoenbachler) options for improving this situation:

  1. lint_dir() gets an argument specifying a vector of files, e.g. files=, user can specify either path= or files= and the rest, including lint_package() is unchanged.
  2. lint() accepts a vector of file names in filename=.
  3. A new lint_files() function.

I can see arguments for both; I'm currently leaning towards option (1) because the code change will be very small & will make it easier to plug in to parallelization in the future. The main advantage of option (2) is not growing the argument surface of any functions. (3) is also a new function, but we could have lint_dir() call it so that all of the logic around combining lints over multiple files is still in one place.

Filing issue here for feedback/discussion.

AshesITR commented 1 year ago

Slightly leaning to lint(). Vectorizing lint() should keep a stable output type IINM, so has no apparent downsides to me.

Progress handling and flattening should be refactored into lint() and lint_package() should probalby also be rewritten to use lint() directly.

While we're at it, we should split lint.R into bite sized chunks. It's become quite long.

New calls could be architected like this:

lint_{dir,package} -> lint() -> lint_impl()

Where lint_impl() is the unvectorized lint() and not exported.

lint_{dir,package}() would do file enumeration and delegate the rest to lint().

MichaelChirico commented 1 year ago

NB: This might also simplify the lint_changed_files workflow:

https://github.com/r-lib/lintr/blob/6e52b41d9ff5f8a4f5209151ec37a2e12b60ab84/.github/workflows/lint-changed-files.yaml#L34-L38

MichaelChirico commented 9 months ago

One thing I noticed in initial work on this -- why do we enable caching for ad-hoc lint workflows (lint(text = ...) or lint("code\n"))? That doesn't seem very practical... is there a good reason for this?

If it's just for testing, I'd refactor tests instead.

AshesITR commented 9 months ago

Not sure if any downstreams use this to lint an open file "in-memory", but such usage could benefit a lot from caching. Seeing that most integrations probably don't run in R, it seems unlikely, though.

MichaelChirico commented 9 months ago

I guess I'm still not really seeing the use case... the "in-memory" cache key is based on the entire file string, right?

https://github.com/r-lib/lintr/blob/847d789d5d398a20ce792d2a9478a615cd0d5dfa/R/cache.R#L32-L34

So to benefit it'd have to be re-running lints dozens/hundreds of times on the same un-changed in-memory string?

Or maybe the idea is running a "random-ish" set of linters many times, some number of which are "new" on each run, and those ones don't have to be re-run?

Probably I'm missing something.

AshesITR commented 9 months ago

We also cache at the linter x expr level (cache_lint)