Memory tuning - Githubissues

This change makes some tactical changes to the current threading model in order to maximize CPU utilization and limit memory consumption, while preserving the deterministic output required for certain current client tool users.

@kelewis has pointed out that building analysis around a limited number of flattened records designed to enable efficient results streaming would be much more efficient. This seems like the right approach to me; I am only uncertain about how we would retrofit determinism for those who need it. We could create a post-processing sorting API. We could design a new binary format for serialized records, optimized for reconstructing into sorted SARIF. Looking forward to the conversation.

As it turns out, our runaway memory usage was do to a simple bug, the context creating channel wasn't throttled on write. Oops.

Tactical changes include:

Throttle context creation to, at most, 25k objects. I will make this configurable in the short-term.
Conversion of failure level and result kind to immutable sets, rather than list instances. We now prefer constant values for the default. These allocations were unnecessary and cluttered profiling results, so good to clear.
Added a simple measure of working set consumed at analysis time.
Eliminated HashUtilities.FileSystem, an unnecessary duplication of the FileSystem.Instance test knob. Taking this change required adding an IFileSystem instance to relevant API (a non-breaking change).
Lowered individual 'file skipped due to size' message to note. Generally, we have a couple of issues with diagnostics reporting: these values can't be individually controlled via the failure levels. They likely should be separated from them entirely, as trace values. The trace messages arguably should always be emitted, when configured. Would love to discuss this further with the team.
Updated 'one or more files skipped' warning to include a count of files skipped. Also moved this warning to the end of analysis (rather than embedding it in the middle of analysis results, when file enumeration is complete).
This change aggressively flushes SARIF json to disk, so that a concentration of results doesn't increase memory consumption. This writing occurs on a single thread and the incremental loss of IO efficiency is unlikely to be observeable as it's amortized alongside the multithreaded scanning.
We also now no longer take an exclusive lock on the SARIF log. This allows users to open an in-progress log or us a tool like tail.exe to watch output as it's generated.
I have disabled the caching/replay of file results by hash. This functionality isn't of general value and it definitely shouldn't be tied to the configuration to emit file hashes for files that are referenced in SARIF. This capability will come back, therefore, as a new config knob. In the meantime, we've dropped the hashing channel, which is helpful for lowering memory usage.

microsoft / sarif-sdk

Memory tuning #2619