github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.75k stars 1.56k forks source link

How to Reason about Merged Taints? #14042

Closed scottconstable closed 1 year ago

scottconstable commented 1 year ago

Suppose that my taint analysis encounters an expression such as:

int z = x + y;

Is there a way to reason about the DataFlow::FlowState of x, y, and z simultaneously? For example, suppose that:

I'm currently using this interface: https://codeql.github.com/codeql-standard-libraries/cpp/semmle/code/cpp/ir/dataflow/internal/tainttracking1/TaintTrackingImpl.qll/type.TaintTrackingImpl$Configuration.html, which does allow from reasoning about two FlowStates at a time, but apparently not 3.

MathiasVP commented 1 year ago

Hi @scottconstable,

This is a great question which unfortunately doesn't have an easy answer since dataflow doesn't allow you to track several nodes "simultaneously". Without knowing exactly the use-case you're considering, here's how I would solve this problem:

  1. Define a dataflow configuration for finding flow to one of the operands of the + so that you can answer the question "what is the set of possible flow states that can reach (say) the left operand of +?", and then
  2. Define another dataflow configuration that tracks the flow-state of the right operand, and use the dataflow information from step 1 to implement your logic.

In code this could look something like[^1]:

import cpp
import semmle.code.cpp.dataflow.new.DataFlow

predicate sourceForLeftConfig(DataFlow::Node source) {
  none() // TODO: Where should the flow for the future left-operand of the add expression start?
}

predicate sourceForRightConfig(DataFlow::Node source) {
  none() // TODO: Where should the flow for the future right-operand of the add expression start?
}

// This is step 1: Finding which states can reach the left operand of an `AddExpr`
module LeftConfig implements DataFlow::StateConfigSig {
  newtype FlowState =
    additional TNoState() or
    additional TState()

  predicate isSource(DataFlow::Node source, FlowState state) {
    sourceForLeftConfig(source) and state = TNoState()
  }

  predicate isAdditionalFlowStep(
    DataFlow::Node node1, FlowState state1, DataFlow::Node node2, FlowState state2
  ) {
    none() // TODO: Specify how to go from "NoState" to "State" for the future left-operand of the add expression.
  }

  predicate isSink(DataFlow::Node source, FlowState state) {
    exists(AddExpr add | source.asExpr() = add.getLeftOperand()) and
    exists(state)
  }
}

module LeftFlow = DataFlow::GlobalWithState<LeftConfig>;

predicate leftOperandHasPossibleState(DataFlow::Node left, LeftConfig::FlowState state) {
  exists(LeftFlow::PathNode pathLeft |
    pathLeft.getNode() = left and
    LeftFlow::flowPath(_, pathLeft) and
    state = pathLeft.getState()
  )
}

// This is step 2: Finding which states can reach the right operand of an `AddExpr`, and transfer (flow, state) to the result accordingly
module Config implements DataFlow::StateConfigSig {
  newtype FlowState =
    additional TNoState() or
    additional TState() or
    additional TCombinedState()

  predicate isSource(DataFlow::Node source, FlowState state) {
    sourceForRightConfig(source) and state = TNoState()
  }

  predicate isAdditionalFlowStep(
    DataFlow::Node node1, FlowState rightState1, DataFlow::Node node2, FlowState rightState2
  ) {
    // This is where we implement the logic from the issue
    exists(AddExpr add, LeftConfig::FlowState leftState |
      // z = x + y
      node1.asExpr() = add.getRightOperand() and
      leftOperandHasPossibleState(DataFlow::exprNode(add.getLeftOperand()), leftState) and
      node2.asExpr() = add
    |
      // "If x has FlowState equal to Taint1 and y does not have a FlowState, then Taint1 should propagate to z"
      leftState = LeftConfig::TState() and
      rightState1 = TNoState() and
      rightState2 = TState()
      or
      // "If y has FlowState equal to Taint2 and x does not have a FlowState, then Taint2 should propagate to z"
      rightState1 = TState() and
      leftState = LeftConfig::TNoState() and
      rightState2 = TState()
      or
      // "If both x and y have FlowStates that differ from one another, then my taint analysis needs to assign a
      // FlowState to z either by choosing between the two FlowStates or by assigning a new one, for example Taint1+2".
      leftState = LeftConfig::TState() and
      rightState1 = TState() and
      rightState2 = TCombinedState()
    )
  }

  predicate isSink(DataFlow::Node source, FlowState state) {
    none() // TODO: Specify where the flow ends
  }
}

As you can see, this is far from trivial. There are also a couple of TODOs in the code above that depends on domain-specific knowledge. If you have any questions to the above, feel free to ask away!

[^1]: I'm using the new API for writing dataflow/taint-flow configurations. The blogpost this make it clear how to convert between the two if you need to. The old-style only supports string-based flow states, so you'll need to swap out the newtypes with strings if you explicitly want to use the old "class-based" approach to writing configurations.

scottconstable commented 1 year ago

Hi @MathiasVP,

Thank you for the very detailed response. This certainly looks like something I could try to implement, but first I have two clarifying questions:

  1. I notice that in your example you import semmle.code.cpp.dataflow.new.DataFlow. Several other cpp examples on the main repo use semmle.code.cpp.ir.dataflow.DataFlow. What is the difference between these two libraries?
  2. It looks like Config depends on LeftConfig via the predicate leftOperandHasPossibleState. Does this mean that I would also have to import semmle.code.cpp.dataflow.new.DataFlow2? And how would I use both libraries together?
MathiasVP commented 1 year ago

I notice that in your example you import semmle.code.cpp.dataflow.new.DataFlow. Several other cpp examples on the main repo use semmle.code.cpp.ir.dataflow.DataFlow. What is the difference between these two libraries?

There's no difference. semmle.code.cpp.dataflow.new.DataFlow is the "official" import path that alias the same library as the one you get from semmle.code.cpp.ir.dataflow.DataFlow. Eventually, import semmle.code.cpp.dataflow.new.DataFlow will go away and semmle.code.cpp.dataflow.DataFlow will become an alias for semmle.code.cpp.ir.dataflow.DataFlow.

It looks like Config depends on LeftConfig via the predicate leftOperandHasPossibleState. Does this mean that I would also have to import semmle.code.cpp.dataflow.new.DataFlow2? And how would I use both libraries together?

That's one of the benefits of using the new module-based API for dataflow: You never need to import anything other than semmle.code.cpp.dataflow.new.DataFlow (or equivalently: semmle.code.cpp.ir.dataflow.DataFlow)!

Back when we used the class-based dataflow API you needed to use DataFlow2, DataFlow3, etc. when you had multiple configuration in scope and they had to "interact" (i.e., by using the result of one dataflow traversal to define the source and/or sink in another traversal) because they'd relied on the semantics of abstract classes. However, the module-based API doesn't use abstract classes to define the dataflow configuration, and thus you can create as many dataflow modules as you want without them interfering with each other in unexpected ways.

scottconstable commented 1 year ago

Thanks @MathiasVP! This was very helpful, and I was able to refactor my analysis into a StateConfigSig and reason about the merged taint.