Handle large (>4GB) SARIF results files

aeisenberg commented 3 years ago

When trying to open the interpreted results of a query run that has produced a sarif results file of >4GB, we get an error like this:

[2021-01-28 18:21:22] CSV_IMB_QUERIES: Query,edges#query#ffffffffffffff nodes#query#fffffffff #select#query#ffffffffffffffffffffff,padlockws2-2.ql,26,Success,291.651,407918,291939
Exception during results interpretation: Reading output of interpretation failed: RangeError [ERR_FS_FILE_TOO_LARGE]: File size (6638382197) is greater than possible Buffer: 4294967295 bytes. Will show raw results instead.

Node limits the size of strings and buffers to 4294967295 bytes, even on machines that have enough ram to support more.

The parsed version of the sarif results could fit in memory, even if the string cannot. It's possible that a streaming JSON parser, like JSONStream could work, but I need to explore this library in more detail and make sure it is safe and stable before we can use.

I don't think it is a good idea to roll our own streaming parser if there is a suitable OSS one available since there would be a fair amount of work involved and getting the edge cases to work is tricky.

Suggested breakdown:

[ ] Get an example (from the team) of a large SARIF file
[ ] Add JSONSchema as a dependency
[ ] Use JSONSchema when reading the SARIF file produced by results interpretation
- [ ] either do this unconditionally
- [ ] or use it as a fallback only when we hit the RangeError
[ ] Ensure we have tests for both regular and large SARIF files

aeisenberg commented 2 years ago

Alternative streaming library is https://www.npmjs.com/package/JSONStream.

After a quick look, JSONStream seems to be better at grabbing pieces of the JSON through a regex-like expression. And stream-json uses SAX-like events and may be better at creating the entire SARIF as a javascript object. To handle large sarif, we will need to read in the entire file, so JSONStream may be more appropriate.

aeisenberg commented 2 years ago

Ensure we have tests for both regular and large SARIF files

Note that creating a test for a large sarif file will require either generating one on the fly, or downloading it from a known location. We cannot check in a 4GB file into the repository. It might be nicer to generate the file, rather than download it to save bandwidth for local development.

Also, note that 4GB is not a hard limit, the size limit is probably more related to the memory available in the local environment.

adityasharad commented 2 years ago

Agree with generating it on the fly. A SARIF file with excessively long path explanations should do the trick.

edoardopirovano commented 2 years ago

I think a better way to test this might be to generate a relatively small SARIF file but somehow constrain the memory available to the test so that we would crash if we weren't streaming it.

marcnjaramillo commented 2 years ago

The repo for JSONStream is archived. Should we still use it?

aeisenberg commented 2 years ago

Hmmm...I hadn't seen that. Thanks for pointing it out. We don't want to use an archived project. There are other alternatives.

How about this library? https://www.npmjs.com/package/stream-json I haven't explored it too much, but on a superficial level at least, it looks promising.

marcnjaramillo commented 2 years ago

Perhaps it was only recently archived?

I'll take a look at this one. Thanks!

JacquesLeRoux commented 2 years ago

Hi,

I'm not sure how to handle that: https://github.com/apache/ofbiz-framework/actions/runs/1402011029. I mean how to reduce the file.

I have also a problem to generate a SARIF file using Spotbugs on command line. So I'm somehow stuck :/

Any other ideas how to generate a SARIF file?

TIA

adityasharad commented 2 years ago

@JacquesLeRoux I think you are facing a different problem -- this repo and issue involve the CodeQL extension for VS Code, but I believe your issue is with uploading CodeQL results to GitHub's code scanning service.

So that we can help you debug why the SARIF file produced by your CodeQL workflow is so large, please:

rerun your workflow with the following flag

- uses: github/codeql-action@v1
with:
  debug: true

create an issue in https://github.com/github/codeql-action with a link to the workflow run

For non-CodeQL tools, see https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github for information on uploading a SARIF file to GitHub.

JacquesLeRoux commented 2 years ago

Hi @adityasharad ,

Got this: https://github.com/apache/ofbiz-framework/actions/runs/1466549612

What to you suggest? TIA

aeisenberg commented 2 years ago

Hi @JacquesLeRoux,

It looks like you are not referencing the correct path to the action. It should be:

uses: github/codeql-action/analyze@v1

But the workflow file has:

uses: github/codeql-action@v1

The uses directive accepts an action specification in the following syntax:

org/repo/path@ref

Where org is a github organization or username, repo is the name of the repo to get the action from. path is the file path in that repository where you can find the folder containing the action.yml file. And ref is a git ref or sha to check out to get that action.

JacquesLeRoux commented 2 years ago

Hi @aeisenberg ,

Yes I followed Aditya's comment above. I must say I'm quite new to CodeQL and even YAML.

Sorry but it did not work either: https://github.com/apache/ofbiz-framework/actions/runs/1467782229

What could it miss? TIA

aeisenberg commented 2 years ago

Ah...apologies for that. the debug option should be added to the init action. Right here: https://github.com/apache/ofbiz-framework/blob/72f86558575a54e5c16748756f7e0a1b7cd82da8/.github/workflows/codeql-analysis.yml#L67 add debug: true and remove it from the analyze step.

This will not fix the problem you are seeing, but it will retain information about your run that will help us triage the problem.

As @adityasharad suggests, after you do this, please create an issue in https://github.com/github/codeql-action with a link to the workflow run.

JacquesLeRoux commented 2 years ago

Thanks Andrew,

Checking that...

JacquesLeRoux commented 2 years ago

In case it would help someone else, here is the workflow run with the (2,77 GB!) log inside: https://github.com/apache/ofbiz-framework/actions/runs/1470420767

github / vscode-codeql

Handle large (>4GB) SARIF results files #735