github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.51k stars 1.49k forks source link

Horrible performance of Python PrintAst query on large databases #6964

Closed kjcolley7 closed 2 years ago

kjcolley7 commented 2 years ago

I'm working on a large Python database that has >10,000 Python source files. I'm using the VSCode extension to write and run queries as I'm developing them. I'd like to be able to use the AST Viewer. However, I have discovered (after hours of debugging) a horrible performance issue that results in this query attempting to compute global flow and points-to analysis, which shouldn't be required to simply print out the AST graph.

The reason this happens is due to the feature that displays the AST of regex strings. Here is the flow of references that causes this to happen:

printAst.ql:

import semmle.python.PrintAst

PrintAst.qll:

import semmle.python.RegexTreeView

RegexTreeView.qll:

private import semmle.python.regex

regex.qll:

/** A StrConst used as a regular expression */
class Regex extends RegexString {
  Regex() { used_as_regex(this, _) }
  //...
}
/**
 * Holds if `s` is used as a regex with the `re` module, with the regex-mode `mode` (if known).
 * If regex mode is not known, `mode` will be `"None"`.
 */
predicate used_as_regex(Expr s, string mode) {
  //...
  // Implementation uses DataFlow and the API graph
  //...
}

The impact of this issue is that attempting to view the AST of a small Python file doesn't complete running after more than 12 hours. By simply patching out the regex support from PrintAST.qll, the same query on the same database completes in under 1 minute.

I suggest simply removing the functionality for dumping the AST of regex strings. I feel that it is completely unnecessary, and it is certainly not worth the performance impact.

hmakholm commented 2 years ago

Thanks for the report. It has spawned some internal discussion about what exactly the AST viewer is for, and how best to support its usability for all the various stakeholders (including our own people who develop queries for regex vulnerabilities). There's no definite conclusion yet, so this is just to say that your complaint has been heard and we're working on it. :slightly_smiling_face:

kjcolley7 commented 2 years ago

Maybe users who want this functionality could enable it in Customizations.qll by extending an abstract class and overriding a predicate like predicate shouldPrintAstForRegexString(<string literal type>). Alternatively, it could be on by default but automatically disable when the database is large, maybe by counting the number of Python source files or lines of code.

tausbn commented 2 years ago

We discussed this internally in the CodeQL Python team, and have agreed that the best approach for now is to disable the printing of regex ASTs.

I'm a bit curious about your use-case, however. If you're unable to run most of our analyses on your database, then you're really missing out on most of the power of our libraries.

Does this database contain multiple copies of the same files? (e.g. one or more copies of the Python standard library.)

This is one case where our analysis gets a lot slower, since it may not be able to disambiguate these copies properly, leading to a lot of apparent cross-talk between them.

kjcolley7 commented 2 years ago

This large database was created during an initial test of setting up CodeQL on a monorepo. I have since created a reduced size database by setting the LGTM_INDEX_INCLUDE env var before invoking the Python extractor via the codeql database trace-command --index-traceless-dbs command. So in reality I'll probably use smaller database slices of that repo, or if I try the single database path again I'll do so on a beefy build machine to see if that can handle it.