github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.32k stars 1.46k forks source link

How to get each Node in the dataflow paths with PathGraph #16881

Open tylergu opened 5 days ago

tylergu commented 5 days ago

Description of the issue

Use Case

We are trying to use the dataflow analysis to get the explicitly accessed fields in a struct in Golang. To do that, we use the TaintTracking module, with source being the root struct, and sink being any(). The background is that we are trying to do some static analysis for Kubernetes controllers, which define complex nested struct types (e.g. RabbitMQ operator).

The taint analysis works like a charm. However, we are struggling to compute the path for the sink nodes. To get the struct field path, we need to get each DataFlow Node on the dataflow path from source to the sink, for all possible paths.

Example

Given the following Golang program, with a nested definition of MyType and MySubType:

type MyType struct {
    subtype MySubType
}

type MySubType struct {
    name string
    id   int
}

func main() {
    foo := MyType{subtype: MySubType{name: "myName", id: 1}}
    bar := foo.subtype.id
    _ = bar
}

our objective is given taint source foo, we want to reach to sink bar (this works well), and get the struct field path subtype.id.

Is there a clean way to get all field selection nodes on the path given the source and the sink?

Thanks a lot!

smowton commented 4 days ago

Does using SsaWithFields help?

from SsaWithFields ssaf
select ssaf, ssaf.getBaseVariable(), ssaf.getAUse()

This for example tells us that the dataflow node foo.subtype.id is a read of .subtype.id.

The tricky thing about walking flow paths and trying to accumulate the field-accesses seen along them is twofold: first there can be a large number of possible paths for a simple-seeming program, and second that there can be field-accesses that don't match how struct types are laid out. For example, if the program went

foo := MyType{subtype: MySubType{name: "myName", id: 1}}
bar := OtherType(otherField: foo.name.id}
baz := bar.otherField

Then the sequence of field accesses along the path from foo to baz is name, id, otherField, which of course doesn't match how the static struct types are laid out.

Could you perhaps give some other examples of what you'd want your query to return in this sort of case?

tylergu commented 4 days ago

Thanks for the prompt reply! @smowton

Then the sequence of field accesses along the path from foo to baz is name, id, otherField, which of course doesn't match how the static struct types are laid out.

Yeah, we would expect .subtype.id in the case you mentioned. Perhaps we can get away by setting field write as the barrier. I think in our use cases such scenarios rarely happen.


Thanks for pointing us the direction!

from SsaWithFields ssaf
select ssaf, ssaf.getBaseVariable(), ssaf.getAUse()

Though one struggle we have is to separate different paths. For example, in the following query, we want two separate paths: .subtype.id and .subtype.name

foo := MyType{subtype: MySubType{name: "myName", id: 1}}
bar := foo.subtype.id
baz := foo.subtype.name

The path-problem seems to be able to automatically group the nodes on the same path, but we did not find a good way to do that in the query itself.

Any insight would be greatly appreciated!

smowton commented 2 days ago

How about doing something like:

import go

Expr getAStructLiteralValue(StructLit e) {
  exists(Expr element | element = e.getAnElement() |
    if element instanceof KeyValueExpr
    then result = element.(KeyValueExpr).getValue()
    else result = element
  )
}

string describeStructLiteralValue(Expr e) {
  exists(Write w, Field f, DataFlow::Node base, DataFlow::Node value |
    w.writesField(base, f, value) and value.asExpr() = e |
    result = describeStructLiteralValue(base.asExpr()) + "." + f.getName()
    or
    result = f.getName()
  )
}

string longestStuctLiteralValueDescription(Expr e) {
  result = max(string s | s = describeStructLiteralValue(e) | s order by s.length())
}

module Blah implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node n) {
    n.asExpr() = getAStructLiteralValue(_) and not n.asExpr() instanceof StructLit
  }

  predicate isSink(DataFlow::Node n) { any() }
}

module Flow = DataFlow::Global<Blah>;

from Flow::PathNode source, Flow::PathNode sink
where Flow::flowPath(source, sink)
and source != sink
select source, sink, longestStuctLiteralValueDescription(source.getNode().asExpr())

Note that by making the sources struct-literal members that are not themselves struct literals (i.e., "myName" and 1), and by using DataFlow instead of TaintTracking, we require that to find a result, there must have been a corresponding set of field accesses to extract the value. That is, you would not get a result for baz2 := foo.subtype, because we haven't fully unpacked the source value "myName" or 1 yet.

That means there's no need to look at the flow path explicitly -- we know if dataflow has found a result, the source has been fully extracted from its surrounding struct, so the field accesses required are exactly the fields it is nested within when the struct literal is created.

On the other hand if you want to highlight the places where the fields are accessed, then we need to do something different -- let me know if so.