github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.61k stars 1.52k forks source link

C++: Improve simple range analysis #4294

Closed gsingh93 closed 4 years ago

gsingh93 commented 4 years ago

I've implemented a number of SimpleRangeAnalysisExprs and I'd like to discuss if any of them could be merged upstream into the main library, or if they could be included as optional "extensions" that could be included in individual queries if needed. We've been running queries with these extensions on a number of big (internal) projects and they've proven useful.

The files below may be missing some included files or helper functions (some are here), they're just for reference.

Supporting more operators in more cases

I've improved what is supported for &, &=, <<, <<=, >>, >>=, \, and \=. (Sidenote: I'm wondering if we could support implementing only the operator and having the assignment version of that operator share that same implementation. I did this myself using this class, but it could be part of the library itself). I'm not going to discuss all the differences here, but as an example, the built in & operator only supports cases where both arguments are either unsigned or a non-negative constant. My changes support "constants" that come from variables, i.e. unsigned a = 7; 42 & a, as well as signed operands where the bounds are non-negative. The other operands "expand" the cases where we compute bounds similar to this.

Shift operators

For the shift operators, there are some interesting considerations around what to do in certain cases of undefined or implementation-defined behavior. I've noted some of these cases in the code, and it seems for the cases CodeQL supports we don't treat this as undefined behavior but do what most common compilers do. This is fine, but just worth calling out.

Division

For \, I chose to not consider division by zero in the bounds, as CodeQL only supports continuous ranges, so even if the code verifies zero isn't a possible value, it's quite common for zero to be in the range. And since this case isn't important unless you're looking for DoS (which we're not), we assume division by zero doesn't happen. This could be make configurable depending on the usecase.

Signed multiplication

I've also implemented support for signed multiplication for * and *=. This does have a significant performance hit, but that hit is acceptable in our use cases. It would be nice if an extension like this was included in the library and then could be imported into queries that were willing to take that performance hit. (Sidenote: I'm unsure why there's a performance hit here. Taking the max/min of four different values doesn't seem like something that should be slow).

std::min and std::max

We found implementing the resultant range for min/max was useful for reducing some FPs and not a huge performance hit: https://github.com/gsingh93/codeql/blob/b7d39f7aa41258927c1d7259f9ddd283554e8123/cpp/ql/src/experimental/semmle/code/cpp/rangeanalysis/rangeextensions/ContainerMethodRange.qll#L217-L261

Container ranges

Probably not something we can merge as-is, but might be worth a discussion. A common FP we've seen is str.size() + 1, or something like that with any method that returns a value relating to the container's size. In practice, this will never overflow on a 64-bit system, as no 64-bit system can support a container with this many elements. Since we run on 64-bit systems, it was helpful to limit the max size for many of these methods to some arbitrary number (we chose INT_MAX - 1). You can see these changes here. Maybe it's worth allowing users to customize this somehow?

I will point out though that std::array can be properly modeled since the size is fixed, the implementation from the above file could be merged upstream.

Handling basic guards

I haven't tested this out yet, but it would reduce real world FP's we've seen and @lcartey has provided some code to test out. The basic goal is to filter out issues like this:

// Example 1
uint32_t a = std::min(b, c);
sink(c - a); // No underflow, a is guaranteed to be <= c

// Example 2
if (n <= size_function()) {
    sink(size_function() - n);
}

The untested implementation to filter this out is here, but we can implement a proper range analysis expression if https://github.com/github/codeql/pull/4273 is merged.

Interprocedural analysis

I have some extensions here that do some rudimentary interprocedural analysis. These range analysis expressions propagate from callers to the function parameters, and also support assigning to reference parameters (thanks @lcartey). This extension propagates the return value of a function to the caller only if the return value is tainted by user controlled data. The restriction of the data being user controlled is to improve performance, but I don't think this should be hardcoded in a general implementation. An Extension class could be provided, and users could override methods there to customize when to propagate arguments or return values.

There's a significant performance hit from these, but we're running them on a number of large internal projects, they're very helpful, and we haven't seen any timeouts. This would be another example of something that could be not included by default, but enabled after including a particular file.

Displaying Ranges in Paths

Triaging path queries that rely on range analysis is much easier when you can see the ranges at each step of the path. Here I've extended DataFlow::ExprNode, DataFlow::ExplicitParameterNode, and DataFlow::DefinitionByReferenceNode to provide range information when viewing the path in vscode. This has made triaging issues much faster. This is something that wouldn't be enabled by default, but would be enabled after including a file.

jbj commented 4 years ago

Thanks for the detailed write-up. I'll try to address the individual items as well as the overall direction.

With a few exceptions that rarely come up in practice, the SimpleRangeAnalysis library is sound: values can never be outside the range that it reports. We have a couple of queries, most notably cpp/constant-comparison, where the lack of soundness translates directly into false positives. Maintaining soundness as well as performance requires extensive testing on both hand-written cases and real-world code. The worst-performing databases are often crypto libraries or image decoding libraries, where macros are used to unroll repetitive code into enormous expressions.

While I'd like to accept most of your contributions into SimpleRangeAnalysis, each one of them will require that extensive testing, review, and possibly reimplementation. We won't always have the resources to do that, and so I'm discussing with @lcartey how we can create a staging area for your contributions so they can get merged into there without waiting for us to validate every corner case.

Overall, we'd like to add support for languages features when the performance overhead of that support is proportional to the use of that language feature in a code base. Most of what you're proposing falls into that category: binary operators on unsigned values, std::min and std::max, container ranges, and possibly "Handling basic guards". The main friction here is that our existing queries are working well without this support, so it's hard for me to prioritise the work that must go into testing.

Some of the other features you're proposing will inevitably make range analysis too slow to be feasible on certain codebases. That includes signed multiplication, signed division, and interprocedural analysis. The Achilles heel of SimpleRangeAnalysis is how loops lead to an explosion in the number of intermediate bounds, and this explosion is made worse when a signed operation produces four different bounds. When many signed operations are chained, it can lead a combinatorial explosion.

In the near term, I propose that I'll create a experimental.semmle.code.cpp.ExtendedRangeAnalysis module that imports SimpleRangeAnalysis along with those of your extension modules that you think are ready. Because it's in the experimental directory, we can evolve the library without maintaining strict backwards compatibility. Over time, some of the extension modules can migrate from the experimental libraries into the core of SimpleRangeAnalysis. This can be driven by you or by us, but it requires writing tests and measuring performance.

In the long term, I hope we can improve the IR-based range analysis so much that it can become the default range analysis. There are still many unknowns in this endeavour, but the IR-based range analysis has two core advantages: loops don't lead to an explosion of (inexact) bounds, and ranges can be determined relative to other variables (x < y) rather than just constants (x < 10).

jbj commented 4 years ago

Sidenote: I'm wondering if we could support implementing only the operator and having the assignment version of that operator share that same implementation.

That'll come for free in the IR range analysis and all other IR-based libraries. In the IR, all of +, += and prefix/postfix ++ all get translated to an AddInstruction surrounded by various loads and stores.

gsingh93 commented 4 years ago

Thanks for the response! I think creating an experimental.semmle.code.cpp.ExtendedRangeAnalysis module is fine for now. I'd be happy to submit proper pull requests with the extension modules and tests that could then go through review. I don't think I'm comfortable enough with QL to modify and test the core range analysis library, so I'm happy enough with this compromise for now until someone can get the non-performance intensive modules integrated properly.

jbj commented 4 years ago

I've opened https://github.com/github/codeql/pull/4332 to create ExtendedRangeAnalysis.

jbj commented 4 years ago

Now that #4332 is merged, we're open for PRs that add new modules to be imported into it. I've made an example module as part of #4332, to show where tests should go.

jbj commented 4 years ago

I'll close this issue and let further work happen in separate per-feature issues and PRs. @gsingh93, I invite you to submit PRs that create extension modules and import them from ExtendedRangeAnalysis.qll.