Customizing string representation of data flow nodes in SARIF or CSV results for Taint Tracking

saikatG commented 3 months ago

Hi,

Currently, when running a cwe query such as TaintedPath (cwe 22) on a java project, I retrieve the CodeFlow for each result in the SARIF files as shown below. Is there an easy way to customize the string representation of nodes in the output, such as, in the message["text"] part for the node? For instance, for a method call I would like the string to have the format "package:class:methodname" instead of "methodname(...)" -- which is the default. Would I need to override the data flow node for this?

Thanks in advance!

...
 "codeFlows" : [ {
        "threadFlows" : [ {
          "locations" : [ {
            "location" : {
              "physicalLocation" : {
                "artifactLocation" : {
                  "uri" : "one-java-agent/src/main/java/com/alibaba/oneagent/AgentImpl.java",
                  "uriBaseId" : "%SRCROOT%",
                  "index" : 0
                },
                "region" : {
                  "startLine" : 81,
                  "startColumn" : 35,
                  "endColumn" : 69
                }
              },
              "message" : {
                "text" : "getFile(...) : String"
              }
            }
          },
....

mbg commented 3 months ago

Hi @saikatG 👋

I am not sure there is a super easy way of doing that. Overriding the toString predicate may be possible here to make this work. However, since you have the location information and, presumably, also have the source code, you could retrieve the information you are after directly from the source code with the help of a Java parser. This is probably the easiest.

saikatG commented 3 months ago

Thanks @mbg. Overriding the toString of DataFlow Node works well!

saikatG commented 3 months ago

Hi CodeQL devs, re-opening this issue because I ran into a problem.

I tried something like this:

bindingset[n]
string getFullName(DataFlow::Node n){

  result = (n.asExpr().(Call)).getCallee().getDeclaringType().getQualifiedName() + "." + (n.asExpr().(Call)).getCallee().getName()
  or 
  result = (n.asExpr().(Argument)).getCall().getCallee().getDeclaringType().getAnAncestor().getQualifiedName() + "." +  (n.asExpr().(Argument)).getCall().getCallee().getName() + "::arg::" + (n.asExpr().(Argument)).getPosition()
  or
  result = (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getQualifiedName() + "." + (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getName()) + "::par::" + (n.(DataFlow::ParameterNode).asParameter().getPosition())) 
  or
  result = n.toString()
}
 class MyNode extends DataFlow::Node {
   override string toString() {
     result = getFullName(this)
   }
 }

It works well, but its turning out to be too expensive to compute in some cases. Any suggestions on how to optimize this?

mbg commented 3 months ago

its turning out to be too expensive to compute in some cases

Could you provide more information? How does this manifest for you (long analysis time / out of memory errors)? How big is the Java project you are running this against?

saikatG commented 3 months ago

So codeql just takes a very long time to generate the csv/sarif files after the "Interpreting results" stage. I waited for 10-15 mins and then killed it. Ofc, it finishes fast w/o the string conversion code above. The project has 97k lines of code. I am running the XSS query, but that shouldn't be an important factor i think?

Running queries.
Compiling query plan for .../XSS.ql.
[1/1 comp 33.6s] Compiled .../XSS.ql.
Starting evaluation of .../XSS.ql.
[1/1 eval 10s] Evaluation done; writing results to .../XSS.bqrs.
Shutting down query evaluator.
Interpreting results.

mbg commented 3 months ago

I assume you're running the CodeQL CLI directly on the command-line. What is the command you used and which version of CodeQL are you using?

I tried running the XSS query with your additions on the database for the repository you linked to and this worked fine for me using the VSCode extension (although I note that there were no results).

saikatG commented 3 months ago

Sorry about the missing details.

Yes, I am using the codeql cli 2.15.3. I use database analyze --rerun [db_dir] --format=sarif-latest --output=results.sarif [query_path]

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution. So, I have a MySources.qll and MySinks.qll file that looks like this:

predicate isMySourceMethod(Method m)
{
(
    m.getName() = "get" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or
(
    m.getName() = "put" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or 
...

and I have a MyXSSQuery.qll file that uses it

module MyXssConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) { 
    isMySourceMethod(source.asExpr().(MethodCall).getMethod()) 
  }
...

(Same for sinks)

I can't share the full files right now. But I hope this gives you an idea. Do you think this may be causing the slowdown? Interestingly, this still works ok when I don't use "toString". Note that I put the "toString" code in this MyXSSQuery.qll file and not the XSS.ql file.

mbg commented 3 months ago

I am using the codeql cli 2.15.3.

This is a fairly old version. It may be worth upgrading to the latest version to see if that resolves the issue for you. (We release new versions every two weeks.)

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution.

This is very much possible of course. You could try to run the XSS query without your custom sources and sinks, but still with the overriden toString predicate, and see if that works OK. If that does work, I'd introduce a source/sink that you know will lead to a result being found and try it with just those.

Let me know what the outcomes of those experiments are and we can go from there.

saikatG commented 3 months ago

Thanks a lot @mbg! I will try it out.

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

mbg commented 3 months ago

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

There is some documentation in the code which comments on this:

The kind column is a tag that can be referenced from QL to determine to which classes the interpreted elements should be added. For example, for sources "remote" indicates a default remote flow source, and for summaries "taint" indicates a default additional taint step and "value" indicates a globally applicable value-preserving step. For neutrals the kind can be summary, source or sink to indicate that the neutral is neutral with respect to flow (no summary), source (is not a source) or sink (is not a sink).

So in other words, taint and value have special meanings that affect how the models are used.

What would you hope to accomplish with a custom tag?

github / codeql

Customizing string representation of data flow nodes in SARIF or CSV results for Taint Tracking #16143