presidentbeef commented 5 years ago

Background

In this document, "incremental scan" means "scanning a subset of files with accuracy comparable to a full scan".

Brakeman performs (essentially) a "full program" or "global" analysis. This means it uses information about the entire application when reporting warnings. This is in contrast to most "linters" which tend to do analysis at the file level (or even more granularly) and can therefore easily analyze one file at a time without negatively affecting results.

Brakeman also needs to know the root of the Rails application. This is because in Rails directories carry semantic information (e.g., app/views/blah/ is used by app/controllers/blah_controller.rb).

Along with that, there is a lot of Brakeman behavior based on the version of Rails in use. This information is most often contained in e.g. Gemfile but can also be based on directory structure if no Gemfile or equivalent is present.

Rescanning

Since 2012, Brakeman has been able to do incremental rescans. However, this required Brakeman to keep all program data in memory. The only tool I'm aware of which can take advantage of this is guard-brakeman.

This means all the logic of conditionally rescanning based on a set of changed files has been around for quite some time (although it is not especially well-tested).

When rescanning, only the changed files are re-parsed. Based on which file changed, Brakeman will redo dataflow analysis on different parts of the application. After that, all checks are re-run and the results are diffed against the previous scan.

`--only-files`

The --only-files option was added as a complement to --skip-files. The intent was to make it easier to avoid parsing/scanning files unrelated to the analysis.

Unfortunately, what has happened is that people try to use --only-files to scan single files or small sets of files. Limiting Brakeman's view this way often leads to unsatisfying and misleading results.

See #580, #1267, #1366, https://github.com/w0rp/ale/pull/2182, etc.

Use Cases

The main use cases for incremental scans are IDE plugins and Git commit hooks.

I am aware of:

A secondary use case would be incremental scans in CI.

Any others?

Requirements

Naturally, the impetus for incremental scans is speed. The slowest parts of a Brakeman scan are reading/parsing files and the dataflow analysis.

As I understand it, folks want the following behavior:

Do minimal work to scan a small set of files or a single file
Avoid false negatives/positives caused by not doing global analysis
Only report findings in the set of files provided

Proposal

My proposal is the serialize the information needed for rescans as they are currently implemented. This information can then be loaded later and the regular rescan can run. The work for this was largely already done a long time ago.

The logical steps would look like this:

Perform a full scan
Serialize information to a file
On subsequent scans:
- Load serialized information
- Rescan only given set of files
- Update and dump the serialized information

Implementation

Here's a proposed API:

brakeman --incremental YOUR_DUMP_FILE --changed-files FILE1,FILE2,FILE3

The files will need to be either absolute paths or relative to the root of the Rails application.

Open Questions

What should the incremental scan return? a. Should it always filter results to the input set of files or should that be a separate option?
Where should the dumped file live? a. Should there be a default directory?
How should the dumped file be protected? a. How/when should the file be cleaned up?
Are there other use cases not covered here?

geoffharcourt commented 5 years ago

@presidentbeef this looks like it could be a great fit for the issues we've been discussing in worp/ale#2182. For the ALE case, this would likely mean almost all of the code from that PR could be removed.

The one issue that caused me to aggressively use --only-files was the point at which problems were reported (namely the file that is flagged as the problem) rather than where the call site of the problem lies. I realize that this can make sense when you're looking at a holistic report, but it is confusing and noisy when in the scope of a single file in an editor buffer. Incremental scans would certainly help the use case of speed in a linter for a text editor, but I'm not sure it resolves this issue.

The most common example we found was unescaped user input being flagged in the view or controller when the point at which the problem is created (and therefore fixable) is the other file. This results in the linter showing a problem as an issue in the current working file, but the fix for these issues is something that needs to be implemented somewhere else.

presidentbeef commented 5 years ago

The most common example we found was unescaped user input being flagged in the view or controller when the point at which the problem is created (and therefore fixable) is the other file

Do you have an example? I suspect this may come down to some personal preference and individual codebases. I can't think of a general solution that's always going to point folks to the code location they want to see in order to understand and fix the vulnerability.

As far as I know, it's pretty standard to point to the use of a dangerous value (the sink). Since "contextually escape on output/use" is also the consensus opinion of the security community, I think that makes the most sense.

Since Brakeman doesn't track sources (in the free version...), the only other information available is the "render path" from controller->view. This information is included in the JSON (and HTML) report but may or may not be relevant.

w0rp commented 5 years ago

I don't actually know too much about Ruby, but it sounds like support for incrementally scanning individual files would be ideal for ALE.

presidentbeef / brakeman

Feedback request: Incremental scans #1368