CPP: Flow Into Barrier Guards

bdrodes commented 2 years ago

I'm looking for general assistance on how to properly use codeql with barrier guards when the guard condition may not be computed in the guard itself, but instead also data traces into a guard:

Consider these two cases, where use(x) is a sink, but if it is guarded by MyGuard(x) it is considered safe. The first case is the easy case where BarrierGuard (or SanitizerGuard) can be used fairly easily, and the latter case seems to require yet another dataflow:

//Easy case
if(MyGuard(x))
    use(x);

//harder case
res = MyGuard(x)
if(res)
   use(x);

For the latter/harder case, what is the CodeQL paradigm to associate the guard to x such that I know use(x) is safe? Seems to me, out of the box, there is no flow from x to res. I could add an additional step, in a dataflow/taint analysis, but that would still not let me define barrier guard check conditions to associate that value with the originating variable. Perhaps there is a way to pass around misc. metadata too that will feed into a BarrierGuard check in these cases?

MathiasVP commented 2 years ago

Hi @bdrodes,

There are a couple of ways to handle this. Sadly, the BarrierGuard doesn't always do what you want in C++. Here's an example of how I'd do it using isBarrier instead:

override predicate isBarrier(Node node) {
  exists(GuardCondition guard |
    // The node is only reached when the value of `guard` is true.
    guard.controls(node.asExpr().getBasicBlock(), true) and
    // and the value of the guard will always evaluate to a value given by a call to `MyGuard`.
    globalValueNumber(guard).getAnExpr().(Call).getTarget().hasName("MyGuard")
  )
}

Note that this uses a couple of libraries that you need to import explicitly:

import semmle.code.cpp.controlflow.Guards
import semmle.code.cpp.valuenumbering.GlobalValueNumbering

For a full example, this query:

/**
 * @kind path-problem
 */

import cpp
import semmle.code.cpp.dataflow.DataFlow::DataFlow
import semmle.code.cpp.controlflow.Guards
import semmle.code.cpp.valuenumbering.GlobalValueNumbering
import PathGraph

class Conf extends Configuration {
  Conf() { this = "Conf" }

  override predicate isSource(Node source) { source.asExpr().(Call).getTarget().hasName("source") }

  override predicate isSink(Node sink) {
    exists(Call c | c.getTarget().hasName("use") | c.getAnArgument() = sink.asExpr())
  }

  override predicate isBarrier(Node node) {
    exists(GuardCondition guard |
      guard.controls(node.asExpr().getBasicBlock(), true) and
      globalValueNumber(guard).getAnExpr().(Call).getTarget().hasName("MyGuard")
    )
  }
}

from Conf conf, PathNode source, PathNode sink
where conf.hasFlowPath(source, sink)
select sink.getNode(), source, sink, ""

won't raise any alerts on the following example:

int source();
void use(int);

bool MyGuard(int);

void test() {
  int x = source();
  if(MyGuard(x)) {
    use(x);
  }

  int y = source();
  bool res = MyGuard(y);
  if(res) {
    use(y);
  }
}

I hope that helps!

bdrodes commented 2 years ago

Thanks @MathiasVP I'll experiment with this a bit. I'm totally unfamiliar with the GlobalValueNumbering module. Sounds like SSA. I'll have to do a bit of research.

bdrodes commented 2 years ago

My only general concern is if this approach will work for more complex control flow with respect to calculating the guard. The example I gave was purposefully simplified.

MathiasVP commented 2 years ago

Let me know if you discover that the suggested approach doesn't work for your use case. In that case, I'm sure we can find something else that will work :)

bdrodes commented 2 years ago

Here's an example of what I mean by more complex flow where I'm not sure how to apply the global value numbering solution yet.

In my solution I've defined a way to parse guard expressions to determine if a particular use of MyGuard evaluates to actually performing a sanitizer/barrier guard or not. For example, I might compare the result to 0, which is a known flag of success for MyGuard.

I've defined an isValid and isInvalid check, so I know which branch of the guard is the 'sanitizer/barrier'. The signature basically looks like this:

/**
 * Checks if `checkExpr` guard contains an expression `targ`
 * and when `checkExpr` evalutes `true`, `targ` is valid.
 */
cached 
predicate isValid(Expr checkExpr, Expr targ)

This is problematic for cases like this below:

int source();
void use(int);
int SUCCESS = 0;

bool MyGuard(int);

void test()
{
  int y = source();
  bool res = MyGuard(y);
  if(res == SUCCESS) {
    use(y);
}

I'm not sure how to use a GlobalValueNumbering here, or if its the right general solution for cases like this. If I would guess, I'd say I need to find a sub-expression of the guard where the number value is MyGuard, then write a new predicate that would parse the computed subexression in a new isValid and isInvalid check, but this is starting to make me think I'm headed down the wrong path again.

bdrodes commented 2 years ago

For more context, this is the style of barrier guard I was using. I had to define my own extension of GuardCondition to ensure I got the full guarding condition, as opposed to a nested expression.

class MyConditionalGuard extends GuardCondition 
{
    MyConditionalGuard() 
    {
        exists(IfStmt i | this = i.getControllingExpr())
        or
        exists(Loop l | this = l.getControllingExpr())
        or
        exists(ConditionalExpr c | this = c.getCondition())
    }
}

class MyBarrierGuard extends DataFlow::BarrierGuard
{
    override predicate checks(Expr e, boolean b)
    {
        exists(MyConditionalGuard mcg |
            this = mcg
        |
            (isValid(mcg, e) and b = true)
            or
            (isInvalid(mcg, e) and b = false)
        )
    }
}

bdrodes commented 2 years ago

Thinking on this more, it just seems like the global value solution will eventually require me to make a separate data trace since the values in my examples may flow arbitrarily into a guard (consider the example below). In these examples I may have go more than 1 step back in the global value numbering to determine if the guard is indeed guarding a use correctly.

I can try to make a recursive global value look up, but it at this point I think I'm re-inventing dataflow. I suppose I could put a dataflow inside my dataflow, but this also seems awkward.

int source();
void use(int);
int SUCCESS = 0;

bool MyGuard(int);

void test()
{
  int y = source();
  bool res = MyGuard(y);
  bool res2 = res == SUCCESS
  if(res2) 
    // GOOD USE
    use(y);

  if(!res2)
    // BAD USE
    use(y)
}

bdrodes commented 2 years ago

Final thought. If a data path must be considered into the guard as well, then it is also unclear if there is any way to establish the isValid or isInvalid on the guarding condition (i.e., which branch is safely validating the use), since the full expression needed to be evaluated is found in the composite of the data path.

To clarify our use case, I'm constantly hitting this pattern of guard and need to make a more robust flow solution. Each use case might have different assumptions we can make on common use, but generally speaking, I have to solve this same problem over and over for different projects with different guarding conditions, so I'm looking for how robust I can make a query in general to fix the general problem of flow into a barrier/sanitizer guard.

bdrodes commented 2 years ago

Alright, I think I have a way forward here. I've applied your global value numbering solution into my isValid and isInvalid predicates I mentioned. I've updated them to not only recurse over 'children' of the expression, but also to consider a child of the expression anything that is in the global value number set for a subexpression. In effect, the isValid and isInvalid do a backwards data trace in evaluating a guard's expression as sanitizing on the false or true branch.

As a result, I have to add it to the isSanitizer for the trace config, not barrier guard as you suggested, but then this renders the barrier guard moot, so barrier guard can be removed entirely from the tracer.

This sounds like the 'right' direction for a general solution given you original suggestion, but I have some concerns about the implied data trace within a data trace (i.e., the sanitizer now does a backwards data trace every invocation). I've used the cached annotation on the isValid and isInvalid predicates which I hope would mean prior computed results are stored to amortize the efficiency a bit.

I also have concerns that to address barrier gaurds generally, we actually don't use any notion of barrier guard, which seems counterintuitive. All that said, if this generally sounds like the 'codeql pardigm' for this general problem, we can call the thread closed, and I'll share this concept with my team.

bdrodes commented 2 years ago

@MathiasVP Continuing to play with this, and I think this direction is very good and also solves associating aliases with guarded conditions. The issues I see now is not every language has a global value numbering library (as far as I can tell), but I assume there is a similar solution with SSA. I'm going to close this thread for now. Thanks.

MathiasVP commented 2 years ago

Hi @bdrodes,

Sorry I'm only responding now! I'm happy you were able to find a satisfying solution that worked for you :)

Indeed, not every language has a global value numbering library, and I'm sure you could do a similar solution with SSA for those languages (you could conceivable do it for C/C++ as well). Let me try to comment on a couple of points you raised:

Alright, I think I have a way forward here. I've applied your global value numbering solution into my isValid and isInvalid predicates I mentioned. I've updated them to not only recurse over 'children' of the expression, but also to consider a child of the expression anything that is in the global value number set for a subexpression. In effect, the isValid and isInvalid do a backwards data trace in evaluating a guard's expression as sanitizing on the false or true branch.

If this works for you that's great! There might be a potential performance problem by considering every expression with the same value number as a subexpression of some expression p to be a subexpression of p (wow, what a mouthful to say out loud). Intuitively, this is because you'd normally have a table child_expr(parent, index, child) with entries like:

child_expr(foo(x, y, z), 0, foo)
child_expr(foo(x, y, z), 1, x)
child_expr(foo(x, y, z), 2, y)
child_expr(foo(x, y, z), 3, z)

(i.e., foo is the first child of foo(x, y, z), x is the second child of foo(x, y, z), etc.).

But with your value number encoding, the table will look like:

child_expr(foo(x, y, z), 0, foo1)
child_expr(foo(x, y, z), 0, foo2)
...
child_expr(foo(x, y, z), 0, fooN)
child_expr(foo(x, y, z), 1, x1)
child_expr(foo(x, y, z), 1, x2)
...
child_expr(foo(x, y, z), 1, xM)
child_expr(foo(x, y, z), 2, y1)
child_expr(foo(x, y, z), 2, y2)
...
child_expr(foo(x, y, z), 2, yP)
child_expr(foo(x, y, z), 3, z1)
child_expr(foo(x, y, z), 3, z2)
...
child_expr(foo(x, y, z), 3, zQ)

where foo1, foo2, ..., fooN all have the same value number (and similarly for x, y, and z).

Once again, if it works for you that's great! But be wary of your analysis exploding :)

This sounds like the 'right' direction for a general solution given you original suggestion, but I have some concerns about the implied data trace within a data trace (i.e., the sanitizer now does a backwards data trace every invocation). I've used the cached annotation on the isValid and isInvalid predicates which I hope would mean prior computed results are stored to amortize the efficiency a bit.

You shouldn't really need a cached annotation in this case, I think. The cached annotation is mostly used when you want a predicate to be evaluated once and then used across multiple queries.

The CodeQL compiler does cache the results automatically, but this is sometimes broken by optimizations: One of the optimizations that CodeQL does is that it specializes your predicates based on the context it's used in, and that might prevent caching of intermediate results. The cached annotation tells the compiler (among other things) to not specialize the predicate.

I also have concerns that to address barrier gaurds generally, we actually don't use any notion of barrier guard, which seems counterintuitive. All that said, if this generally sounds like the 'codeql pardigm' for this general problem, we can call the thread closed, and I'll share this concept with my team.

Yes, barrier guards doesn't work super well for C/C++ at the moment, and we're actively working on improving this situation. I hope that you'll soon be able to write the barrier guard you expect and get the right behavior automatically. I'll be sure to note this thread down in our internal issue about this 🙇.

bdrodes commented 2 years ago

@MathiasVP It's interesting you say CPP in particular is bad with barrier guards, but from what I can tell, the problem I raised is the same in every language. Is there a language where this problem doesn't exist? In particular, I need to address this problem in C#, and Java immediately, but likely will need it addressed in every other language in the near future. Java is one case where we don't have global value numbering.

bdrodes commented 2 years ago

@MathiasVP Is there a way to get involved in the discussion and experimentation with barrier guard improvements? This, in my opinion, is the primary bottleneck in effectively using codeql in almost every query I've come across, created, or requests to assess/fix other people's queries. Is there someone at GitHub I can grab some time with to hash out some of the concerns and timelines?

bdrodes commented 2 years ago

@MathiasVP (attempting to split my comments into separate comments to keep the threads easier to follow).

If this works for you that's great! There might be a potential performance problem by considering every expression with the same value number as a subexpression of some expression p to be a subexpression of p (wow, what a mouthful to say out loud). Intuitively, this is because you'd normally have a table child_expr(parent, index, child) with entries like:

I was worried about this too, so I defined something like this to avoid re-evaluating an expression that is identical in hopes I would just get the definition (which is why I had hoped an SSA version of this would work):

Expr getDerivedExpr(Expr e, boolean strict)
{
    (strict = false and result = e) or 
    (   
        (strict = true or strict = false) and
        result = globalValueNumber(e).getAnExpr() and 
        result != e
    )
}

The idea is when I recurse parsing a guard, I get a child expression based on the semantics of the expression. I want to get the original child back or something in the globalValueNumber set that isn't the same expression. The 'strict' boolean just allows me to force using globalValueNumbering.

Example use inside a predicate called isValid:

   exists(NotExpr notexpr, Expr child |
        checkExpr = notexpr and
        getDerivedExpr(notexpr.getOperand(), false) = child and
        isValid(child, targ)
    )

I'm not sure if this addresses the explosion issue you are describing, but my intent here was to avoid an explosion.

bdrodes commented 2 years ago

@MathiasVP I was looking at how to do this similar trick for csharp. I found this is a GVN module but it is private. Do you have any suggestions for csharp?

github / codeql

CPP: Flow Into Barrier Guards #10011