joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
2.01k stars 270 forks source link

[Bug] Weird results for file inclusion in c2cpg #4924

Open UESuperGate opened 2 weeks ago

UESuperGate commented 2 weeks ago

Describe the bug Given a directory with two files fetch.h and fetch.c as below:

fetch.h

/* If set, the ref filename to write the target value to. */
extern const char *write_ref;

fetch.c

#include "fetch.h"
#include "cache.h"

const char *write_ref = NULL;

void pull_say(const char *fmt, const char *hex) 
{
    if (get_verbosely)
        fprintf(stderr, fmt, hex);
}

Using Joern command line with ImportCode will cause two weird results:

R1: If we parse the whole directory (i.e., both fetch.h and fetch.c at the same time), the following file nodes will be generated:

joern> cpg.file.l
val res3: List[io.shiftleft.codepropertygraph.generated.nodes.File] = List(
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "fetch.h",
    order = 0
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "fetch.h",
    order = 0
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "<includes>",
    order = 1
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "<unknown>",
    order = 0
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "fetch.c",
    order = 0
  )
)

While I fully understand fetch.c file node, I'm pretty confused about the other 4 file nodes. Why there are 2 file nodes for fetch.h instead of 1? Is fetch.h node a subset of <includes> node? What does the <unknown> file node mean?


R2: If we only parse fetch.c, the file nodes are shown as below:

joern> cpg.file.l
val res5: List[io.shiftleft.codepropertygraph.generated.nodes.File] = List(
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "./fetch.h",
    order = 0
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "<includes>",
    order = 1
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "<unknown>",
    order = 0
  ),
  File(
    code = "<empty>",
    columnNumber = None,
    content = "<empty>",
    hash = None,
    lineNumber = None,
    name = "fetch.c",
    order = 0
  )
)

It seems like Joern can automatically include fetch.h into the analysis even if I did not manually specify. However, I noticed that the file imports at line 1-2 in fetch.c are missing using the command cpg.file.nameExact("fetch.c").ast.isImport.l (which is empty), but can be found using cpg.file.nameExact("./fetch.h").ast.isImport.l, which are shown below:

joern> cpg.file.nameExact("./fetch.h").ast.isImport.l
val res9: List[io.shiftleft.codepropertygraph.generated.nodes.Import] = List(
  Import(
    code = "#include \"fetch.h\"",
    columnNumber = Some(value = 1),
    explicitAs = None,
    importedAs = Some(value = "fetch.h"),
    importedEntity = Some(value = "fetch.h"),
    isExplicit = None,
    isWildcard = None,
    lineNumber = Some(value = 1),
    order = 1
  ),
  Import(
    code = "#include \"cache.h\"",
    columnNumber = Some(value = 1),
    explicitAs = None,
    importedAs = Some(value = "cache.h"),
    importedEntity = Some(value = "cache.h"),
    isExplicit = None,
    isWildcard = None,
    lineNumber = Some(value = 2),
    order = 2
  )
)

joern> cpg.file.nameExact("fetch.c").ast.isImport.l
val res10: List[io.shiftleft.codepropertygraph.generated.nodes.Import] = List()

To Reproduce Steps to reproduce the behavior of R1:

  1. Run joern in shell.
  2. Run importCode("<path_to_directory>", "test") in joern's command line, where <path_to_directory> specifies the path to the directory only contains fetch.h and fetch.c.
  3. Run cpg.file.l in joern's commandline.
  4. See the output.

Steps to reproduce the behavior of R2:

  1. Run joern in shell.
  2. Run importCode("<path_to_fetch_c>", "test") in joern's command line, where <path_to_fetch_c> specifies the path to fetch.c.
  3. Run cpg.file.nameExact("fetch.c").ast.isImport in joern's command line to see the file import nodes of fetch.c.
  4. Run cpg.file.l in joern's commandline to get the name of fetch.h's file node.
  5. Run cpg.file.nameExact("<path_to_fetch_h>").ast.isImport in joern's command line, where <path_to_fetch_h> is the file node name we have got from step 4 to see the import nodes of fetch.h.

Expected behavior For R1, I expected only three file nodes are generated (fetch.h, fetch.c and <includes>).

For R2, I expected the step 3's output contains fetch.h and cache.h, and step 5's output has nothing because fetch.h does not have any file imported.

Desktop (please complete the following information):

UESuperGate commented 2 weeks ago

Here I found the meaning of <unknown> file node. But still confused about the problems in R1 and R2.

UESuperGate commented 2 weeks ago

I think I found where it goes weird.

In joern-cli/frontends/c2cpg/src/main/scala/io/joern/c2cpg/astcreation/AstCreator.scala (line 51), method createAst creates file node with the name fileName(cdtAst):

def createAst(): DiffGraphBuilder = {
    val fileContent = if (!config.disableFileContent) Option(cdtAst.getRawSignature) else None
    val fileNode    = NewFile().name(fileName(cdtAst)).order(0) // create file node with `fileName(cdtAst)`
    fileContent.foreach(fileNode.content(_))
    val ast = Ast(fileNode).withChild(astForTranslationUnit(cdtAst))
    Ast.storeInDiffGraph(ast, diffGraph)
    diffGraph
}

fileName method is defined in joern-cli/frontends/c2cpg/src/main/scala/io/joern/c2cpg/astcreation/AstCreatorHelper.scala (line 89-92). I guess this method returns the file name the given IASTNode belongs to via the nullSafeFileLocation method.

protected def fileName(node: IASTNode): String = {
    val path = nullSafeFileLocation(node).map(_.getFileName).getOrElse(filename)
    SourceFiles.toRelativePath(path, config.inputPath)
}

For the example I showed when explaining R2, path finally equals fetch.h's path instead of fetch.c's. Here I believe the problem comes from nullSafeFileLocation. However, I am not quite familiar with Eclipse's AST parser, so I cannot tell what's going wrong.

As an alternative solution, I think using getContainingFilename of IASTNode directly should be good. This method can locate where the AST node is according to the [docs](https://help.eclipse.org/latest/topic/org.eclipse.cdt.doc.isv/reference/api/org/eclipse/cdt/core/dom/ast/IASTNode.html#getContainingFilename())

protected def fileName(node: IASTNode): String = {
    /// val path = nullSafeFileLocation(node).map(_.getFileName).getOrElse(filename)
    val path = node.getContainingFilename()
    SourceFiles.toRelativePath(path, config.inputPath)
}

This at least works for the scenario discussed in R2. But I'm not sure whether it would cause other errors in Joern.