github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.53k stars 1.5k forks source link

CPP:Some questions about Control flow analyse and query time #10411

Open icy17 opened 2 years ago

icy17 commented 2 years ago

I just start learning Codeql, and I want to analyze Data flow and Control flow of a program. Now I learn getASuccessor and Global DataFlow analyze. And I have some questions:

  1. Is there any way (api)to do global control flow analysis? I find that getASuccessor can only get reachability within A function. How to know if a node in a function can reach node in some other functions?
  2. I write a function to do global control flow analysis, but the query takes a long long time to run.... It's a function that recursively calls getASuccessor, I wonder why it's taking so long.....getASuccessor?or recursive?Is there any way to do this faster? Part of the code: image
icy17 commented 2 years ago

And I found that the time to run the same query is different... What affects the speed of running query?

MathiasVP commented 2 years ago

Is there any way (api)to do global control flow analysis? I find that getASuccessor can only get reachability within A function. How to know if a node in a function can reach node in some other functions?

We don't have a library for that, unfortunately. We've experimented with it some time ago in a Hackathon and it worked quite well, but we didn't bring it into production yet as we didn't have any queries that were using it.

For now, you can roll your own ad-hoc version of it by doing something like:

import cpp

ControlFlowNode getAnInterproceduralSuccessor(ControlFlowNode n) {
  result = n.(Call).getTarget().getEntryPoint()
  or
  not n instanceof Call and
  result = n.getASuccessor()
  or
  not exists(n.getASuccessor()) and
  exists(Function f |
    f = n.getBasicBlock().getEnclosingFunction() and
    result = f.getACallToThisFunction().getASuccessor()
  )
}

This predicate will behave like getASuccessor, but in case of a function call it will transfer control-flow to the body of that function.

Note that tracking control-flow into a function is super easy, but tracking flow out of a function is a lot more difficult since there might be multiple places to return to. In our dataflow library we add a small amount of call context so that we often know which function call we should return to. But obviously, my 10-line predicate here doesn't do that :)

I write a function to do global control flow analysis, but the query takes a long long time to run.... It's a function that recursively calls getASuccessor, I wonder why it's taking so long.....getASuccessor?or recursive?Is there any way to do this faster?

I see some issues with the snippet you posted. Here are some comments:

And I found that the time to run the same query is different... What affects the speed of running query?

The first time you run a query that uses some of the standard library predicates CodeQL will evaluate all the necessary library predicates and cache them. This means that the second run of the same code will be a lot quicker.

But obviously, if you then change something that the library depends on (for instance, if you add a new implementation of an abstract class that's used by a library, or if you you modify the libraries) then CodeQL will once again re-evaluate all of the libraries.

I hope that helps!

icy17 commented 2 years ago

Is there any way (api)to do global control flow analysis? I find that getASuccessor can only get reachability within A function. How to know if a node in a function can reach node in some other functions?

We don't have a library for that, unfortunately. We've experimented with it some time ago in a Hackathon and it worked quite well, but we didn't bring it into production yet as we didn't have any queries that were using it.

For now, you can roll your own ad-hoc version of it by doing something like:

import cpp

ControlFlowNode getAnInterproceduralSuccessor(ControlFlowNode n) {
  result = n.(Call).getTarget().getEntryPoint()
  or
  not n instanceof Call and
  result = n.getASuccessor()
  or
  not exists(n.getASuccessor()) and
  exists(Function f |
    f = n.getBasicBlock().getEnclosingFunction() and
    result = f.getACallToThisFunction().getASuccessor()
  )
}

This predicate will behave like getASuccessor, but in case of a function call it will transfer control-flow to the body of that function.

Note that tracking control-flow into a function is super easy, but tracking flow out of a function is a lot more difficult since there might be multiple places to return to. In our dataflow library we add a small amount of call context so that we often know which function call we should return to. But obviously, my 10-line predicate here doesn't do that :)

I write a function to do global control flow analysis, but the query takes a long long time to run.... It's a function that recursively calls getASuccessor, I wonder why it's taking so long.....getASuccessor?or recursive?Is there any way to do this faster?

I see some issues with the snippet you posted. Here are some comments:

  • The basic block resbb isn't bound to any particular value. Remember that CodeQL evaluates your program by joining together relations. When you have an unused variable like resbb, CodeQL will perform a cartesian product between parts of your program and all BasicBlocks in the database.
  • There are a lot of control-flow nodes in a database, and a transitive closure over the getASuccessor relation is very very large. So if you need a transitive closure over getASuccessor, I suggest you instead use the getASuccessor member predicate on BasicBlocks. There are a lot fewer basic blocks in a program than control-flow nodes, so this will be much faster.

And I found that the time to run the same query is different... What affects the speed of running query?

The first time you run a query that uses some of the standard library predicates CodeQL will evaluate all the necessary library predicates and cache them. This means that the second run of the same code will be a lot quicker.

But obviously, if you then change something that the library depends on (for instance, if you add a new implementation of an abstract class that's used by a library, or if you you modify the libraries) then CodeQL will once again re-evaluate all of the libraries.

I hope that helps!

Thanks! I remove some unused value. It's helpful!

icy17 commented 2 years ago

When I run a query, most of the time it shows: image that means getASuccessor cost a lot? Is there any way to make it run faster?(it costs more than 3 hours now :( Part of code: image

MathiasVP commented 2 years ago

There are probably ways to make it faster, yes.

First of all: What are you trying to do? Can you paste in all of your QL code (preferably not as an image as I'd like to be able to run it myself as well)? And can you show an example piece of code that demonstrate which code patterns you're looking for?

icy17 commented 2 years ago

This code is a little long, and I want to detect memory leak in a global scope, like: malloc-free

memory leak:

#include <stdio.h>
#include <stdlib.h>

void right(void)
{
    char* a;
    int i = 0;
    a = malloc(4);
    free(a);
}

void right_with_return(void)
{
    char* a;
    int i = 0;
    a = malloc(4);
    free(a);
    return;
}

void wrong(void)
{
    char* a;
    int i = 0;
    a = malloc(4);
}

void wrong_with_return(void)
{
    char* a;
    int i = 0;
    a = malloc(4);
    return;
}

char* gen_ptr(void)
{
    char* a = malloc(1);
    return a;
}

void right_taint(char* p)
{
    free(p);
    // return
}

void wrong_taint(char* p)
{
    return;
}

void test_if_leak(void)
{
    char* a;
    int i = 1;
    a = malloc(2);
    int rand_num = rand();
    if (i == rand_num)
    {
        return;
    }
    free(a);
    return ;
}

void test_if_leak5(void)
{
    char* a;
    int i = 1;
    a = malloc(2);
    free(a);
    int rand_num = rand();
    if (i == rand_num)
    {
        return;
    }
    free(a);
    return ;
}

char* test_if_leak2(char* p)
{
    char* a;
    int i = 1;
    // a = malloc(2);
    int rand_num = rand();
    if (i == rand_num)
    {
        return p;
    }
    free(p);
    return NULL;
}

void test_if_leak3(char* p)
{
    int i = 1;
    int rand_num = rand();
    p = malloc(10);
    if (i == rand_num)
    {
        return;
    }
    if (i + 10 == rand_num)
    {
        rand_num = rand();
        if(rand_num == 7)
        {
            free(p);
             return;
        }
        else{
            printf("wrong");
        }
        return;

    }
    else{
        free(p);
    }
    free(p);
    return ;
}

int main(void)
{
    char* a;
    char* b;
    char* c;
    char* d;
    int i = 0;
    if(i == rand())
    {
        printf(1);
        return 0;
    }
    a = gen_ptr();
    test_if_leak();
    test_if_leak2(a);
    test_if_leak3(a);
    a = gen_ptr();
    right_taint(a);
    a = gen_ptr();
    wrong_taint(a);
    a = malloc(4);
    c = a;
    d = a + 1;
    b = malloc(3);
    free(c);
    return 0;
}

And My QL code:

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.dataflow.DataFlow
import semmle.code.cpp.security.Security
import DataFlow::PathGraph
class TestConfiguration extends DataFlow::Configuration {
  TestConfiguration() { this = "TestConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(FunctionCall fc |
      fc.getTarget().hasName("malloc")
      and (fc = source.asExpr())      
    )
  }
  override predicate isSink(DataFlow::Node sink) {
    // sink.asExpr()
    exists(FunctionCall fc |
      fc.getTarget().hasName("free")
      and fc.getAnArgument() = sink.asExpr()
    )
  }
}

class EvilConfiguration extends DataFlow::Configuration {
  EvilConfiguration() { this = "EvilConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(FunctionCall fc |
      fc.getTarget().hasName("malloc")
      and (fc = source.asExpr())

    )
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(Expr rt | 
      rt = sink.asExpr()

    )
  }

}

class AllConfiguration extends DataFlow::Configuration {
  AllConfiguration() { this = "AllConfiguration" }

  override predicate isSource(DataFlow::Node source) {
    exists(Expr expr | expr = source.asExpr()          
    )
  }

  override predicate isSink(DataFlow::Node sink) {
    // sink.asExpr()
    exists(Expr rt | rt = sink.asExpr()

    )
  }

}

predicate isConditionFree(FunctionCall free)
{
  exists(BasicBlock parent, BasicBlock freebb, BasicBlock child |
  freebb.getANode() = free
  and freebb.getAPredecessor*() = parent
  and freebb.getASuccessor*() = child
  and child.getAPredecessor() = parent
  and not child = freebb
  and not parent = freebb
  and not child = parent)
}
FunctionCall getRightFree(FunctionCall malloc) {
  exists(TestConfiguration cfg ,FunctionCall free| result = free and free.getTarget().hasName("free") and cfg.hasFlow(DataFlow::exprNode(malloc), DataFlow::exprNode(free.getAnArgument())))
}
predicate isdataFlow(Expr exp1, Expr exp2) {
  exists(AllConfiguration cfg | 
    cfg.hasFlow(DataFlow::exprNode(exp1), DataFlow::exprNode(exp2)))
}

BasicBlock getLeakBlock(FunctionCall malloc_parent, FunctionCall malloc)
{
  exists( BasicBlock leak, ExitBasicBlock exit | 
    (exit = leak or getGlobalReachBlock(leak) = exit)
    and (leak = getGlobalReachBlock(malloc_parent) or leak.getANode() = malloc_parent)
    and not exists(FunctionCall free | ((free.getTarget().hasName("free")
        and free = getRightFree(malloc))
         or (free.getTarget() = getAllFreeFunction(malloc) and isdataFlow(malloc_parent, free.getAnArgument()))
         )
        and not isConditionFree(free)
        and (free = leak.getANode() 
        or getGlobalReachBlock(free) = leak
        or getGlobalReachBlock(leak).getANode() = free
        // or (fc = leak.getANode() and fc = getparentCall(free))
    ))
    and not exists(ReturnStmt rt, EvilConfiguration cfg| 
          ((exit.getANode() = rt or leak.getANode() = rt)
          and cfg.hasFlow(DataFlow::exprNode(malloc), DataFlow::exprNode(rt.getExpr())
          )) or
          (rt.getExpr() = malloc_parent)

          )
    and exists(BasicBlock tmp | 
            isinLife(malloc, tmp)
            and tmp.getEnclosingFunction() = leak.getEnclosingFunction())
    and result = leak
      )
}

    FunctionCall getFunctionAfter(ControlFlowNode node)
    {
      exists(BasicBlock bb , FunctionCall fcin| 
        (bb = node.getASuccessor*() or bb.getANode() = node)
        and fcin.getEnclosingFunction() = bb.getEnclosingFunction()
        and node.getLocation().isBefore(fcin.getLocation())
        and bb.getANode() = fcin
        and (result = fcin))
    }
    BasicBlock getGlobalReachBlock(ControlFlowNode node)
  {
      result = node.getASuccessor*()
      or
      exists(FunctionCall fc, ControlFlowNode firstbb| 
        fc = getFunctionAfter(node)
        and firstbb = fc.getTarget().getEntryPoint()
        and result = getGlobalReachBlock(firstbb))
  }
  FunctionCall getparentCall(FunctionCall fc)
  {
    exists(Function func , FunctionCall re| 
      fc.getEnclosingFunction() = func
      and re.getTarget() = func
      and result = re)
  }

Function aFreeFunction(FunctionCall malloc)
{
  exists(FunctionCall free | (free.getTarget().hasName("free") 
  and free = getRightFree(malloc)
  )
  or free.getTarget() = getAllFreeFunction(malloc)
  and result = free.getTarget()
  )
}

Function getAllFreeFunction(FunctionCall malloc)
{
  exists(Function func | 
    not exists( BasicBlock leak, ExitBasicBlock exit | 
      (exit = leak or leak.getASuccessor*() = exit)
      and (leak.isReachable() )
      and not exists(FunctionCall free | free.getTarget() = aFreeFunction(malloc)
          and free = getRightFree(malloc)
          and not isConditionFree(free)
          and (free = leak.getANode() 
          or free.getASuccessor*() = leak
          or leak.getASuccessor*().getANode() = free
      ))

      and func = leak.getEnclosingFunction()
        )
        and not exists(ReturnStmt rt, EvilConfiguration cfg, BasicBlock exit| 
          exit.getANode() = rt and exit.getEnclosingFunction() = func
          and cfg.hasFlow(DataFlow::exprNode(malloc), DataFlow::exprNode(rt.getExpr()))
          )
        and result = func
        and exists(BasicBlock bb | bb.getEnclosingFunction() = func and isinLife(malloc, bb))
        )

}
predicate isinLife(FunctionCall malloc, BasicBlock block) {
  exists(EvilConfiguration cfg, Expr expr , Function funclife| 
    cfg.hasFlow(DataFlow::exprNode(malloc), DataFlow::exprNode(expr))
    and expr.getEnclosingFunction() = funclife
    and block.getEnclosingFunction() = funclife)

}

from FunctionCall malloc, BasicBlock leak, FunctionCall fc
where 

malloc.getTarget().hasName("malloc")
and fc = getparentCall*(malloc)
and leak = getLeakBlock(fc, malloc)

select malloc, fc

I'm not familiar with Github's comments feature, and I'm not familiar with these formats, sorry . Thank you for your prompt response. 👍

icy17 commented 2 years ago

I run a query on postgre, and It didn't finish in eight hours... I wonder if this is because the project is too large, and if there is a way to reduce the size of the project, it will reduce the query time. I think of a possible way to reduce the query time, but I don't know if it will work. If I restrict all basicblocks to a particular file, will this reduce query time? Like:

Change:

FunctionCall getFunctionAfter(ControlFlowNode node)
    {
      exists(FunctionCall fcin| 
        node.getASuccessor*() = fcin
        and result = fcin)
    }

TO:

FunctionCall getFunctionAfter(ControlFlowNode node)
    {
      node.getControlFlowScope().getAFile().getBaseName().toString() = "small.c"
      and
      exists(FunctionCall fcin| 
        node.getASuccessor*() = fcin
        and result = fcin)
    }