joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
1.97k stars 267 forks source link

[Bug] wrong LINE_NUMBER_END of control structure "ELSE" in cpg .dot files #4825

Closed molepi40 closed 1 week ago

molepi40 commented 1 month ago

Describe the bug For example, here is the c code named main.c

#include <stdio.h>

int main() 
{
    int x = 1;
    if (x == 1)
    {
        printf("x == 1");
    }
    else
    {
        printf("x != 1");
    }
}

and this is the corresponding dot graph statement for "else" at line 10 in original c code.

47244640257[label=CONTROL_STRUCTURE ARGUMENT_INDEX="-1" CODE="else" COLUMN_NUMBER="5" CONTROL_STRUCTURE_TYPE="ELSE" LINE_NUMBER="11" ORDER="3" PARSER_TYPE_NAME="CASTCompoundStatement"]

I am confused about the LINE_NUMBER which is not the line where "else" exactly is but its line number plus one. I have also tried similar examples for other c and c++ code but lead to the same problem except "else" and the left brace in the same line. If you can fix this issue? Thank you.

To Reproduce Steps to reproduce the behavior:

  1. run joern-parse command for the c code above
  2. run joern-export command to export cpg .dot files
  3. see the main.dot file

Expected behavior Exported dot file containing the right LINE_NUMBER for else.

Desktop (please complete the following information): OS: Linux 5.15.0-48-generic ubuntu 22.04 Joern Version: v4.0.27 Java version: openjdk 19.0.2 2023-01-17 OpenJDK Runtime Environment (build 19.0.2+7-Ubuntu-0ubuntu322.04) OpenJDK 64-Bit Server VM (build 19.0.2+7-Ubuntu-0ubuntu322.04, mixed mode, sharing)

Additional context There is also a confusion about the name of the node in dot file which is abviously a much bigger number than that generated by Joern of version 2.0.448 I used before. Is it a new issue of Joern ?

max-leuthaeuser commented 1 month ago

The problem here is the following: Eclipse CDT AST elements, such as the else keyword, are mapped directly to their content after parsing. In this case a block (compound statement). The block begins after the line break, hence line number of the else + 1. Keywords are only present as tokens, but do not carry any line/column number information (You will see the same problem everywhere if the code contains a newline directly after a keyword).

I will be unavailable for the next ~2 weeks. Will try to find a solution afterwards if no one else came up with something in the meantime.

max-leuthaeuser commented 1 month ago

There is also a confusion about the name of the node in dot file which is abviously a much bigger number than that generated by Joern of version 2.0.448 I used before. Is it a new issue of Joern ?

This is due to the change to flatgraph. @mpollmeier Looks weird when exporting to .dot. Anything we can do here?

max-leuthaeuser commented 1 week ago

So I had some time for another look at it. As I said, keywords like else are only present as tokens, but do not carry any line/column number information. Retrieving/calculating them manually is a hell of a hack and at the moment I'm against adding that.

Unless CDT decides to add some helper function for that (unlikely as CDT is in maintance mode afaik) I am closing this issue here.