github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.51k stars 1.49k forks source link

Python/JS: Running CodeQL CLI against large datasets #9675

Open alech97 opened 2 years ago

alech97 commented 2 years ago

Description of the issue

I'm currently trying to use CodeQL CLI to run a fixed set of queries against a very large number of individual JavaScript and Python files. These files are completely independent of each other, and any dependency linkage between the individual files would be a false positive.

Currently, I am breaking these files out into folders of a certain size and running database create and database analyze against each folder. But this process is very inefficient and it seems like many steps are redundant.

Question How can I make this process as efficient as possible? For example, is there a way to clear a dataset and repopulate it with new files?

smowton commented 2 years ago

There definitely is some redundancy in arbitrary batching, mainly in extracting information relating to the standard library and common dependencies, but the way to overcome this would usually be to extract the files as one big project.

So what happens if you make the batch sizes bigger? Do you see query results that depend on spurious dependencies between supposedly-independent files in fact, and if so what do those spurious results look like?

alech97 commented 2 years ago

I've done some profiling, and it seems that I get the best performance at ~4.2k files per batch. It starts to approach memory errors if I get too high.

I'm not seeing many spurious dependencies in the results, which is good.