Closed scottconstable closed 1 year ago
Hi @scottconstable,
This is a great question which unfortunately doesn't have an easy answer since dataflow doesn't allow you to track several nodes "simultaneously". Without knowing exactly the use-case you're considering, here's how I would solve this problem:
+
so that you can answer the question "what is the set of possible flow states that can reach (say) the left operand of +
?", and thenIn code this could look something like[^1]:
import cpp
import semmle.code.cpp.dataflow.new.DataFlow
predicate sourceForLeftConfig(DataFlow::Node source) {
none() // TODO: Where should the flow for the future left-operand of the add expression start?
}
predicate sourceForRightConfig(DataFlow::Node source) {
none() // TODO: Where should the flow for the future right-operand of the add expression start?
}
// This is step 1: Finding which states can reach the left operand of an `AddExpr`
module LeftConfig implements DataFlow::StateConfigSig {
newtype FlowState =
additional TNoState() or
additional TState()
predicate isSource(DataFlow::Node source, FlowState state) {
sourceForLeftConfig(source) and state = TNoState()
}
predicate isAdditionalFlowStep(
DataFlow::Node node1, FlowState state1, DataFlow::Node node2, FlowState state2
) {
none() // TODO: Specify how to go from "NoState" to "State" for the future left-operand of the add expression.
}
predicate isSink(DataFlow::Node source, FlowState state) {
exists(AddExpr add | source.asExpr() = add.getLeftOperand()) and
exists(state)
}
}
module LeftFlow = DataFlow::GlobalWithState<LeftConfig>;
predicate leftOperandHasPossibleState(DataFlow::Node left, LeftConfig::FlowState state) {
exists(LeftFlow::PathNode pathLeft |
pathLeft.getNode() = left and
LeftFlow::flowPath(_, pathLeft) and
state = pathLeft.getState()
)
}
// This is step 2: Finding which states can reach the right operand of an `AddExpr`, and transfer (flow, state) to the result accordingly
module Config implements DataFlow::StateConfigSig {
newtype FlowState =
additional TNoState() or
additional TState() or
additional TCombinedState()
predicate isSource(DataFlow::Node source, FlowState state) {
sourceForRightConfig(source) and state = TNoState()
}
predicate isAdditionalFlowStep(
DataFlow::Node node1, FlowState rightState1, DataFlow::Node node2, FlowState rightState2
) {
// This is where we implement the logic from the issue
exists(AddExpr add, LeftConfig::FlowState leftState |
// z = x + y
node1.asExpr() = add.getRightOperand() and
leftOperandHasPossibleState(DataFlow::exprNode(add.getLeftOperand()), leftState) and
node2.asExpr() = add
|
// "If x has FlowState equal to Taint1 and y does not have a FlowState, then Taint1 should propagate to z"
leftState = LeftConfig::TState() and
rightState1 = TNoState() and
rightState2 = TState()
or
// "If y has FlowState equal to Taint2 and x does not have a FlowState, then Taint2 should propagate to z"
rightState1 = TState() and
leftState = LeftConfig::TNoState() and
rightState2 = TState()
or
// "If both x and y have FlowStates that differ from one another, then my taint analysis needs to assign a
// FlowState to z either by choosing between the two FlowStates or by assigning a new one, for example Taint1+2".
leftState = LeftConfig::TState() and
rightState1 = TState() and
rightState2 = TCombinedState()
)
}
predicate isSink(DataFlow::Node source, FlowState state) {
none() // TODO: Specify where the flow ends
}
}
As you can see, this is far from trivial. There are also a couple of TODOs in the code above that depends on domain-specific knowledge. If you have any questions to the above, feel free to ask away!
[^1]: I'm using the new API for writing dataflow/taint-flow configurations. The blogpost this make it clear how to convert between the two if you need to. The old-style only supports string-based flow states, so you'll need to swap out the newtype
s with strings if you explicitly want to use the old "class-based" approach to writing configurations.
Hi @MathiasVP,
Thank you for the very detailed response. This certainly looks like something I could try to implement, but first I have two clarifying questions:
semmle.code.cpp.dataflow.new.DataFlow
. Several other cpp
examples on the main repo use semmle.code.cpp.ir.dataflow.DataFlow
. What is the difference between these two libraries?Config
depends on LeftConfig
via the predicate leftOperandHasPossibleState
. Does this mean that I would also have to import semmle.code.cpp.dataflow.new.DataFlow2
? And how would I use both libraries together?I notice that in your example you import semmle.code.cpp.dataflow.new.DataFlow. Several other cpp examples on the main repo use semmle.code.cpp.ir.dataflow.DataFlow. What is the difference between these two libraries?
There's no difference. semmle.code.cpp.dataflow.new.DataFlow
is the "official" import path that alias the same library as the one you get from semmle.code.cpp.ir.dataflow.DataFlow
. Eventually, import semmle.code.cpp.dataflow.new.DataFlow
will go away and semmle.code.cpp.dataflow.DataFlow
will become an alias for semmle.code.cpp.ir.dataflow.DataFlow
.
It looks like Config depends on LeftConfig via the predicate leftOperandHasPossibleState. Does this mean that I would also have to import semmle.code.cpp.dataflow.new.DataFlow2? And how would I use both libraries together?
That's one of the benefits of using the new module-based API for dataflow: You never need to import anything other than semmle.code.cpp.dataflow.new.DataFlow
(or equivalently: semmle.code.cpp.ir.dataflow.DataFlow
)!
Back when we used the class-based dataflow API you needed to use DataFlow2
, DataFlow3
, etc. when you had multiple configuration in scope and they had to "interact" (i.e., by using the result of one dataflow traversal to define the source and/or sink in another traversal) because they'd relied on the semantics of abstract classes. However, the module-based API doesn't use abstract classes to define the dataflow configuration, and thus you can create as many dataflow modules as you want without them interfering with each other in unexpected ways.
Thanks @MathiasVP! This was very helpful, and I was able to refactor my analysis into a StateConfigSig
and reason about the merged taint.
Suppose that my taint analysis encounters an expression such as:
Is there a way to reason about the
DataFlow::FlowState
ofx
,y
, andz
simultaneously? For example, suppose that:x
hasFlowState
equal to"Taint1"
andy
does not have aFlowState
, then"Taint1"
should propagate toz
y
hasFlowState
equal to"Taint2"
andx
does not have aFlowState
, then"Taint2"
should propagate toz
x
andy
haveFlowState
s that differ from one another, then my taint analysis needs to assign aFlowState
toz
either by choosing between the twoFlowState
s or by assigning a new one, for example"Taint1+2"
.I'm currently using this interface: https://codeql.github.com/codeql-standard-libraries/cpp/semmle/code/cpp/ir/dataflow/internal/tainttracking1/TaintTrackingImpl.qll/type.TaintTrackingImpl$Configuration.html, which does allow from reasoning about two
FlowState
s at a time, but apparently not 3.