proposal: x/tools/go/analysis: change singlechecker/multichecker to allow customization of package loading

hyangah commented 2 years ago

Background

The existing singlechecker.Main and multichecker.Main are simple: they take package patterns string args (e.g. ./...), loaded them with a hard-coded go/packages.Config, and applied analyzers on the loaded packages. They also hand flags common across many analyzers, and handle error parsing, etc.

They are convenient, but not flexible enough to support analyzers that need special setup, more involved go/packages.Config (e.g. excluding tests, loading modules, applying overlays). An example analyzer I want to write:

It needs to initialize its state based on the module information loadable with go/packages.NeedModules load mode. Current APIs don't use NeedModules nor expose the loaded packages.
The initialization needs to access the loaded packages - e.g. this loads vulnerability info from a remote DB, and the loaded go/packages.Package and their modules are needed to retrieve only relevant info.
Since singlechecker.Main and multichecker.Main handle flag parsing, attempting to parse flags or command arguments before them will clobber their flag parsing logic.

Potential workaround is either to load the packages twice or to write the analyzer in a way that the very first call of its Run triggers the remaining initialization work. , which isn't ideal.

I propose we introduce a new API that provides hooks and potential customization options. With the change users can

Use a different load mode, configuration, or even filter/fake the packages.
Inject an extra initialization logic that requires loaded packages before running analysis.

Option 1: A new `Run` that takes a `Config` parameter:

package singlechecker

type Config = checker.Config

func Run(cfg Config, analyzers ...*analysis.Analyzer) { ... }

func Main(analyzers ...*analysis.Analyzer) {
  Run(Config{}, analyzers...)
}

package checker // golang.org/x/tools/go/analysis/internal/checker

type Config struct {
   // Load is called once at the start of Run to load packages specified by patterns.
   Load func(mode packages.LoadMode, patterns []string) ([]*packages.Package, error)
}

Option 2: A package-level global variable for stubbing:

We can consider a package-level global Load or Config variable that affects the existing Main's functionality.

package multichecker

// Load is a load function used by Main when loading packages to analyze.
// This is exported so users can inject their custom package load logic.
var Load = func(mode packages.LoadMode, patterns []string) ([]*packages.Package, error) { /* default load */ }

// Main uses Loader to load packages. 
func Main(analyzers... *analysis.Analyzer) { ... }

Since singlechecker and multichecker are used as the 'main' of an analysis tool and already made use of many process-wide global variables, I expect the introduction of a global, exported variable is not too bad.

cc @golang/tools-team @adonovan @timothy-king

timothy-king commented 2 years ago

It needs to initialize its state based on the module information loadable with go/packages.NeedModules load mode.

We can forward information from packages.Modules to the Pass when a flag is set on the Analyzer (like RunDespiteErrors). This would still be per *Package, but it would be more information. Would that suffice? If this would help, that should probably be its own proposal. (I vaguely recall this being discussed before. I did not find it.)

expose the loaded packages

I don't think I know what this means. Can you be more specific about the API?

The initialization needs to access the loaded packages - e.g. this loads vulnerability info from a remote DB, and the loaded go/packages.Package and their modules are needed to retrieve only relevant info.

It is not yet clear to me whether this will fit into the {single,multichecker} framework yet. Config will need to make sense within unitchecker mode and split across processes. Not clear if this is compatible with your goals or not yet.

Also the checkers cache on *Pass, which are conceptually (*Analyzer, *Package) pairs. For a remote db, it will need to be possible to invalidate a *Pass based on an external source of data while in unitchecker mode.

We can potentially expose a different part of internal/analysis too. Examples include a different Main function in another package or a more direct Run function that does not support unitchecker mode (or uses it differently). I am not sure I understand the requirements well enough to know if this would be compatible with your goals.

Option 2: A package-level global variable for stubbing:

2cents: I am not seeing an advantage of doing Option 2 over Option 1 yet. One could just pass Load as parameter.

gopherbot commented 2 years ago

Change https://go.dev/cl/410254 mentions this issue: go/analysis: add Config, Run in single/multichecker

hyangah commented 2 years ago

We can forward information from packages.Modules to the Pass when a flag is set on the Analyzer (like RunDespiteErrors). This would still be per *Package, but it would be more information. Would that suffice? If this would help, that should probably be its own proposal. (I vaguely recall this being discussed before. I did not find it.)

The kind of initialization I want to do should run outside the analysis - for example, accessing the network to load extra info such as vulndb queries. We don't want to do that in the middle of some analyzer step is running in arbitrary time. We want the network access to run in predictable manner.

https://go-review.googlesource.com/c/tools/+/410125/1 demonstrates a simple example analyzer that controls package loading and injects custom initialization logic using a proposed api (option 1)

At this moment, unitchecker isn't my focus, but I expect that can be also extended in a similar style (each distributed process runs a kind of initialization)

I am not seeing an advantage of doing Option 2 over Option 1 yet. One could just pass Load as parameter.

Smaller API surface - Instead of two ways (Run and Main), there is only one way to start analysis. I expect most analyzers will not need to switch to the new API.

adonovan commented 2 years ago

I'm not convinced that either of these options is actually the best solution to the specific problem at hand (that we discussed in person), which is that you have an analysis (similar to https://go-review.googlesource.com/c/tools/+/408715) you wish to apply to all the packages of a program in topological order, but your analysis also needs global information about the program (a set of vulndb entries) that forces you to call packages.Load before doing the analysis. I think this indicates that the analysis simply isn't an instance of the kind of modular analysis that the go/analysis framework was designed to support. It would be relatively straightforward to modify the analysis to perform its own parallel depth-first traversal over the go/packages.Package graph. It wouldn't need to use Facts because it could assume a single address space. (Indeed, I developed the sketch in the CL above using go/packages directly and then converted it to go/analysis once it was working. It wasn't a difficult change and it would be easy to undo.)

An alternative approach would be to design an interface to the vulndb that allows each go/analysis Pass to query the database piecemeal as it encounters the need. This would of course make more calls to the database (one per package, or perhaps as few as one per package on the critical path) but they could be cached in the local file system to avoid repeat calls from developer machines.

hyangah commented 2 years ago

It wouldn't need to use Facts because it could assume a single address space. (Indeed, I developed the sketch in the CL above using go/packages directly and then converted it to go/analysis once it was working. It wasn't a difficult change and it would be easy to undo.)

Module approach in single address space is still beneficial because it allows incremental updates. For example, when a user is editing a package, we don't need to recompute the whole program again but can use the facts from previous analysis for unchanged packages.

We can invent our own wheel to allow such incremental analysis but writing this in a custom way would make it harder to be adopted by existing linters or gopls.

An alternative approach would be to design an interface to the vulndb that allows each go/analysis Pass to query the database piecemeal as it encounters the need.

This is what I want to avoid. In gopls, for example, diagnostics and analysis run can occur at arbitrary time while users edits a file. I expect the vulndb query and local copy update to be completely decoupled from the actual analysis (either updated periodically, or only when go.mod change is detected)

hyangah commented 2 years ago

@adonovan and I had a chat and we decided to explore a different approach where the analyzer takes the input from local source specified with analyzer-specific flags, and initializes within the Run if necessary.

https://go-review.googlesource.com/c/tools/+/410126

is an attempt to demonstrate this approach.

I think we can optimize the local source format further.

hyangah commented 2 years ago

Analyzer-specific flags that specify input files or state to sideload (e.g. --vulns-file in https://go-review.googlesource.com/c/tools/+/410126) allows analyzers to load state lazily as their run is invoked first time.

Still this doesn't solve the original problem -

Different package load mode
Non-default build tags & build flags controlled by go/packages.Config.BuildFlags. Users can still set this using GOFLAGS env var, but that's less accessible.
Include tests or not, controlled by go/packages.Config.BuildFlags

I agree that it's not ideal to expose the dependency on go/packages from singlechecker/multichecker's APIs, but given that singlechecker/multichecker didn't work without go/packages API for a while, I wonder if it's time to embrace go/packages in the singlechecker/multichecker (note: this issue is about singlechecker/multichecker APIs, not go/analysis package).

adonovan commented 2 years ago

I think it would be reasonable to expose an API for running analyzers on a whole program loaded using go/packages, that allows the client to parse flags, load the program, and analyze it under its control, and to process the results without necessarily printing them. The underlying code would be shared with singlechecker and multichecker. Some care is required to ensure that the program is loaded using a configuration that is acceptable to the analysis that follows. And the Analyzer / Pass API should never depend on go/packages.

It would also be ok to expose go/packages configuration options through the single/multichecker CLI, such as the -test flag you added today. But single/multichecker are really not intended to be used as anything other than the main function of a standalone program.

timothy-king commented 2 years ago

+1 to everything Alan said (except calling it "whole program"). I think it is reasonable to expose what is effectively a lower level API for Run() from a different package than {single,multi}checker. As an example a check.Run(, []analysis.Analyzers, []*packages.Package) could produce a ([]Facts, Result, []Diagnostics) for each provided top level package. Essentially generate the actions+passes needed from preload packages, execute those actions, and then expose the results of the directly requested passes. There are a number of details to debate for such an API so this not a real suggestion yet. But it would solve the problem I am seeing in https://go-review.googlesource.com/c/tools/+/410125/1 and https://go-review.googlesource.com/c/tools/+/410126. What this would likely need to give up is providing a modular mode that is compatible with go vet -vettool=....

timothy-king commented 2 years ago

FWIW a hack you could do shove this into a modular analyzer is to split into 2 processes that: 1) fetch the remote db, determine if an update happened compared to the last run (likely with manually managed files), and generate a flag value --timestamp representing the last update, and 2) run a {single,multi}checker with the externally computed --timestamp flag.

This should force the modular analysis machinery to discard all of the previous results. The resulting analyzer would be tricky to run from command line but this may not be an issue for you. This does not solve custom loading.

hyangah commented 2 years ago

I think exposing part of checker API addresses many issues that were closed without actually being addressed

https://github.com/golang/go/issues/30231 https://github.com/golang/go/issues/30219 https://github.com/golang/go/issues/31007 https://github.com/golang/go/issues/31897 https://github.com/golang/go/issues/50265 https://github.com/golang/go/issues/53063 ...

We've seen many lint-like tools came up with their own solutions and copying of checker package internals (which I think is totally fine, but would be nice if we can help people avoiding such copying when possible.)

Re: https://github.com/golang/go/issues/53215#issuecomment-1147893924 That is what https://go-review.googlesource.com/c/tools/+/410126 is already doing - except --timestamp is an implementation detail specific to the type of analyzers and I think it's out of scope.

ianlancetaylor commented 2 years ago

What is the current proposal? Thanks.

rsc commented 2 years ago

I suspect both of these should be deprecated in favor of a better package. I couldn't use them the last time I tried, and forking them was a pain due to use of internal packages.

gopherbot commented 2 years ago

Change https://go.dev/cl/411907 mentions this issue: go/analysis/checker: a go/packages-based driver library

adonovan commented 2 years ago

A sketch of a new API of components for building go/packages-based analysis drivers can be found in https://go.dev/cl/411907.

This CL contains an Analyze function that takes an extensible Input struct, initially listing packages and analyzers, and returns an extensible Output struct, initially containing the roots of the action graph. An action is a step of work done by the checker, identified by the pair (analyzer, package). The rest of the action graph can be found by chasing pointers, analogous to go/packages. Each action exposes its analyzer, package, error, result, diagnostics, and facts.

The API in that CL is still not sufficient to implement all of multichecker. We still need functions for:

(It is possible to test most analyzers today with analysistest, but it too is a monolith that could be broken into reusable chunks of functionality--to load packages, run the checkers, verify the expectations, verify the suggested fixes--while also allowing ad hoc test logic to access all of the result data structures that are currently encapsulated.)

I have not sketched the API for these four features yet. My initial instinct was to publish a set of functions similar in form to the existing ones, but @hyangah raised the possibility that several of these operations might usefully be considered instances of a general post-processing interface, perhaps similar to the one in golangci-lint, which is used for a range of tasks including:

filtering diagnostics:

skip diagnostics in generated files
adjust cgo file names and skip some fishy cgo diagnostics
skip diagnostics with //nolint annotations
limit the number of diagnostics from a given analyzer or file
deduplicate diagnostics
discard diagnostics of low severity
skip diagnostics in certain files or directories

transforming diagnostics:

add a prefix to each file name
retrofit quotation marks around %s portions of specific diagnostics that should have been %q
remove unnecessary prefix from each file name
sort
attach text of source line to each diagnostic

and side effects:

apply quick fixes

It's not obvious to me that we need the full generality of another extensible plug-in mechanism, but clearly the question warrants more thought.

At this stage, the main questions to decide are:

is the initial API sketched in the CL acceptable?
should we proceed with it, or first produce a complete proposal for the other features listed above?

zpavlinovic commented 2 years ago

FWIW, I recently stumbled on what I believe is a good use case of this: in a server environment, run vet checkers and store the results in a database. I believe this could be accomplished using this driver: use packages.Load to load the packages, run vet analyzers on it, and in the post-processing step save the results to the db.

adonovan commented 2 years ago

That does seem like a totally reasonable program to want to write, but the way I would write it today without the checker API is by fork+execing the multichecker executable and saving its JSON output. The overheads of process startup compared to everything else are negligible, and there are isolation benefits to using multiple processes.

timothy-king commented 2 years ago

but the way I would write it today without the checker API is by fork+execing the multichecker executable and saving its JSON output.

This is a fair point. But it does impose some complexity on users. Most likely in the form of bundling an extra executable. Not insurmountable, but comparatively inconvenient.

golang / go