github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.69k stars 1.54k forks source link

Tracing Parameter Provenance with CodeQL in VSCode Codebase #16191

Closed YanLitao closed 7 months ago

YanLitao commented 7 months ago

I'm exploring the capabilities of CodeQL for tracking the provenance of parameters in large codebases, specifically focusing on the positionDelta parameter within the Visual Studio Code (VSCode) repository. My goal is to understand how effectively CodeQL can trace this parameter back to its origin across multiple files and functions.

In the VSCode codebase, the parameter positionDelta is used in vscode/src/vs/editor/browser/widget/codeEditor/codeEditorWidget.ts line 1778 as follows:

compositionType: (text: string, replacePrevCharCnt: number, replaceNextCharCnt: number, positionDelta: number) => {
    if (replaceNextCharCnt || positionDelta) {
        ...
    }
    ...
}

By manually following references and definitions, I traced the usage of positionDelta through several files, such as viewController.ts and textAreaHandler.ts, eventually reaching textAreaInput.ts where it's part of an event firing method:

this._register(this._textArea.onCompositionUpdate((e) => {
    ...
    this._onType.fire(typeInput);
    ...
}

Given this chain of usages, I'm curious whether CodeQL can automatically trace the positionDelta parameter from its usage in codeEditorWidget.ts to its origin and through all the intermediate steps. How deep can CodeQL go in uncovering the flow of this parameter in a complex codebase like VS Code's? Can it connect all the dots as I did manually?

I tried the following query:

import javascript

class PositionDeltaFlowConfig extends DataFlow::Configuration {
  PositionDeltaFlowConfig() { this = "PositionDeltaFlowConfig" }

  override predicate isSource(DataFlow::Node source) { any() }

  override predicate isSink(DataFlow::Node sink) {
    exists(Expr expr, Function f |
      f.getName() = "compositionType" and
      f.getFile().getBaseName() = "codeEditorWidget.ts" and
      expr = f.getBody().getAChild*().(Expr) and
      expr.toString() = "positionDelta" and
      sink = DataFlow::exprNode(expr) and
      sink.getStartLine() = 1753
    )
  }
}

from PositionDeltaFlowConfig config, DataFlow::Node source, DataFlow::Node sink
where config.hasFlow(source, sink)
select source, source.getFile().getAbsolutePath() + " in line: " + source.getStartLine(), sink,
  sink.getFile().getAbsolutePath() + " in line: " + sink.getStartLine()

With this query, I can successfully get the line of compositionType in vscode/src/vs/editor/browser/widget/codeEditor/codeEditorWidget.ts line 1778. From here, I can keep tracking of compositionType:

import javascript

predicate isCompositionTypeFunction(Function f) {
  f.getName() = "compositionType" and
  f.getNumParameter() = 4 and
  f.getParameter(0).getName() = "text" and
  f.getParameter(1).getName() = "replacePrevCharCnt" and
  f.getParameter(2).getName() = "replaceNextCharCnt" and
  f.getParameter(3).getName() = "positionDelta"
}

from Function f
where isCompositionTypeFunction(f)
select f

But it can not find the instances like this._viewController.compositionType(e.text, e.replacePrevCharCnt, e.replaceNextCharCnt, e.positionDelta);

mbg commented 7 months ago

Hi @YanLitao 👋

Function will find you function definitions, while this._viewController.compositionType(e.text, e.replacePrevCharCnt, e.replaceNextCharCnt, e.positionDelta); is a function call.

I was able to find it with:

predicate isCompositionTypeCall(DataFlow::CallNode call) {
  isCompositionTypeFunction(call.getACallee())
}
YanLitao commented 7 months ago

Hi @mbg,

First, I want to express my gratitude for your prompt and helpful response that made the initial query work. With your solutions, I am now exploring further into the codebase and have come across specific functions and methods related to the variable positionDelta.

I am interested in understanding how data flows through different nodes to positionDelta, particularly in scenarios like this one in vscode-main/src/vs/editor/browser/controller/textAreaInput.ts:

this._register(this._textAreaInput.onType((e: ITypeData) => { // <- now, I want to find some nodes like this: this._textAreaInput.onType(...) ...
    if (e.replacePrevCharCnt || e.replaceNextCharCnt || e.positionDelta) {
        ...
        this._viewController.compositionType(e.text, e.replacePrevCharCnt, e.replaceNextCharCnt, e.positionDelta); // <- this is what I am looking for on my previous question.
    }
}));

And also in method invocations such as .onType(), which is actually related to this._textAreaInput.onType example above:

this._register(this._textArea.onCompositionEnd((e) => {
    ...
    this._onType.fire(typeInput); // <- this one 
    ...
}));

Could you advise if it's possible to create a CodeQL query that tracks all intermediate nodes, like the ones mentioned above, and other related flows that eventually interact with positionDelta? This variable is specifically used in vscode-main/src/vs/editor/browser/widget/codeEditor/codeEditorWidget.ts at line 1778. This may similar to the "Data Flow to Here" to analyze data upstream in IntelliJ IDEA's dataflow analysis.

I am aiming to map out all occurrences and flows that influence or utilize positionDelta throughout the project. Any guidance on setting up such a comprehensive tracking query would be greatly appreciated.

Thank you for your assistance!

mbg commented 7 months ago

Hi @YanLitao,

Yes, this is possible with a suitable taint tracking configuration. If I understand your question correctly, the following modifications to your code should get you the results you are looking for:

/**
 * @kind path-problem
 */

import javascript
import DataFlow
import DataFlow::PathGraph

class PositionDeltaFlowConfig extends TaintTracking::Configuration {
  PositionDeltaFlowConfig() { this = "PositionDeltaFlowConfig" }

  override predicate isSource(DataFlow::Node source) { any() }

  override predicate isSink(DataFlow::Node sink) {
    exists(DataFlow::CallNode call | isCompositionTypeFunction(call.getACallee()) | sink = call.getArgument(3))
  }
}

predicate isCompositionTypeFunction(Function f) {
  f.getName() = "compositionType" and
  f.getNumParameter() = 4 and
  f.getParameter(0).getName() = "text" and
  f.getParameter(1).getName() = "replacePrevCharCnt" and
  f.getParameter(2).getName() = "replaceNextCharCnt" and
  f.getParameter(3).getName() = "positionDelta"
}

from PositionDeltaFlowConfig config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Data flow to compositionType"

Note that we do have a lot of documentation and tutorials for CodeQL at https://codeql.github.com/docs/index.html which covers many aspects of how to write queries like this and what the capabilities of the system are.

YanLitao commented 7 months ago

Hi @mbg,

Thank you for your prompt and insightful response! Your suggestions and the linked documentation are proving incredibly helpful as I continue to explore CodeQL's capabilities.

To refine my earlier question: I am interested in tracking the usage of any given variable within our project, including all methods, functions, or conditions that directly utilize this variable (let's call this set A). Additionally, I am looking to identify all the methods, functions, or conditions that invoke the members of set A (let's call this set B). Essentially, the goal is to establish a comprehensive map of direct and indirect interactions with the specified variable across different files within the project.

Would it be possible to structure a CodeQL query that not only identifies set A but also traces back to set B, capturing the full scope of interactions? Any further guidance or examples on configuring such a query would be immensely valuable.

Thank you once again for your support and guidance!

mbg commented 7 months ago

To refine my earlier question: I am interested in tracking the usage of any given variable within our project, including all methods, functions, or conditions that directly utilize this variable (let's call this set A). Additionally, I am looking to identify all the methods, functions, or conditions that invoke the members of set A (let's call this set B). Essentially, the goal is to establish a comprehensive map of direct and indirect interactions with the specified variable across different files within the project.

Would it be possible to structure a CodeQL query that not only identifies set A but also traces back to set B, capturing the full scope of interactions? Any further guidance or examples on configuring such a query would be immensely valuable.

Yes, this is also possible. You would have to modify the definition of what sources and sinks are to match your sets A and B respectively.