joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
2.12k stars 290 forks source link

Jeorn cannot find method from the source code #3136

Open nekon02 opened 1 year ago

nekon02 commented 1 year ago

Hi, I am trying to use Joern for my project of vulnerability detection from source code graph representation with machine learning. For some of the source code in the dataset, joern cannot find the primary method and only have global in a name list with only 18 nodes. Is there any way to fix this? This is the C code I try to extract the graph from:

static time_t asn1_time_to_time_t(ASN1_UTCTIME * timestr TSRMLS_DC)
    {
        time_t ret;
        struct tm thetime;
        char * strbuf;
        char * thestr;
        long gmadjust = 0;

        if (timestr->length < 13) {
                php_error_docref(NULL TSRMLS_CC, E_WARNING, "extension author too lazy to parse %s correctly", timestr->data);
                    return (time_t)-1;
            }

        strbuf = estrdup((char *)timestr->data);

            memset(&thetime, 0, sizeof(thetime));

        thestr = strbuf + timestr->length - 3;

            thetime.tm_sec = atoi(thestr);
            *thestr = '\0';
        thetime.tm_mon = atoi(thestr)-1;
        *thestr = '\0';
        thestr -= 2;
        thetime.tm_year = atoi(thestr);

        if (thetime.tm_year < 68) {
            thetime.tm_year += 100;
        }

        thetime.tm_isdst = -1;
        ret = mktime(&thetime);

        return ret;
    }

This is the output from Joern:

joern> cpg.method.name.l
val res10: List[String] = List("<global>", "<global>")
joern> cpg.method.dotAst.l
val res11: List[String] = List(
  """digraph "&lt;global&gt;" {
"5" [label = <(METHOD,&lt;global&gt;)<SUB>1</SUB>> ]
"6" [label = <(BLOCK,&lt;empty&gt;,&lt;empty&gt;)<SUB>1</SUB>> ]
"7" [label = <(UNKNOWN,static time_t asn1_time_to_time_t(ASN1_UTCTIME *timestr TSRMLS_DC)\012{\012    time_t ret;\012    struct tm thetime;\012    char *strbuf;\012    char *thestr;\012    long gmadjust = 0;\012\012    if (timestr-&gt;length &lt; 13) {\012        php_error_docref(NULL TSRMLS_CC, E_WARNING, &quot;extension author too lazy to parse %s correctly&quot;, timestr-&gt;data);\012        return (time_t)-1;\012    }\012\012    strbuf = estrdup((char *)timestr-&gt;data);\012    memset(&amp;thetime, 0, sizeof(thetime));\012\012    thestr = strbuf + timestr-&gt;length - 3;\012\012    thetime.tm_sec = atoi(thestr);\012    *thestr = '\0';\012    thetime.tm_mon = atoi(thestr) - 1;\012    *thestr = '\0';\012    thestr -= 2;\012    thetime.tm_year = atoi(thestr);\012\012    if (thetime.tm_year &lt; 68) {\012        thetime.tm_year += 100;\012    }\012\012    thetime.tm_isdst = -1;\012    ret = mktime(&amp;thetime);\012\012#if HAVE_TM_GMTOFF\012    gmadjust = thetime.tm_gmtoff;\012#else\012    gmadjust = -(thetime.tm_isdst ? (long)timezone - 3600 : (long)timezone + 3600);\012#endif\012\012    ret += gmadjust;\012\012    efree(strbuf);\012\012    return ret;\012},static time_t asn1_time_to_time_t(ASN1_UTCTIME *timestr TSRMLS_DC)\012{\012    time_t ret;\012    struct tm thetime;\012    char *strbuf;\012    char *thestr;\012    long gmadjust = 0;\012\012    if (timestr-&gt;length &lt; 13) {\012        php_error_docref(NULL TSRMLS_CC, E_WARNING, &quot;extension author too lazy to parse %s correctly&quot;, timestr-&gt;data);\012        return (time_t)-1;\012    }\012\012    strbuf = estrdup((char *)timestr-&gt;data);\012    memset(&amp;thetime, 0, sizeof(thetime));\012\012    thestr = strbuf + timestr-&gt;length - 3;\012\012    thetime.tm_sec = atoi(thestr);\012    *thestr = '\0';\012    thetime.tm_mon = atoi(thestr) - 1;\012    *thestr = '\0';\012    thestr -= 2;\012    thetime.tm_year = atoi(thestr);\012\012    if (thetime.tm_year &lt; 68) {\012        thetime.tm_year += 100;\012    }\012\012    thetime.tm_isdst = -1;\012    ret = mktime(&amp;thetime);\012\012#if HAVE_TM_GMTOFF\012    gmadjust = thetime.tm_gmtoff;\012#else\012    gmadjust = -(thetime.tm_isdst ? (long)timezone - 3600 : (long)timezone + 3600);\012#endif\012\012    ret += gmadjust;\012\012    efree(strbuf);\012\012    return ret;\012})<SUB>1</SUB>> ]
"8" [label = <(METHOD_RETURN,ANY)<SUB>1</SUB>> ]
  "5" -> "6"
  "5" -> "8"
  "6" -> "7"
}
""",
  """digraph "&lt;global&gt;" {
"13" [label = <(METHOD,&lt;global&gt;)<SUB>1</SUB>> ]
"14" [label = <(BLOCK,&lt;empty&gt;,&lt;empty&gt;)> ]
"15" [label = <(METHOD_RETURN,ANY)> ]
  "13" -> "14"
  "13" -> "15"
}
"""
)
joern> cpg.method.dotPdg.l
val res12: List[String] = List(
  """digraph "&lt;global&gt;" {

}
""",
  """digraph "&lt;global&gt;" {

}
"""
)

I using Joern version 2.0.19 in WSL ubantu. Thank you in advance.

max-leuthaeuser commented 1 year ago

There are a lot of macros/defines in that code. Are they available? (included? in your systems c/c++ compiler path). Otherwise, parsing that might fail.

nekon02 commented 1 year ago

Thank you for replying quickly, but sadly no, the dataset that I used (Bigvul) only includes part of the function source code and does not include any additional header files or defines part. Do you have any recommendations on how I should do in this situation?

I also have another question, when using joern-parse and joern-export is there a way to only export the main method of the source code so I can save it to a correct data row in the dataset?

max-leuthaeuser commented 1 year ago

It looks like these macros/defines are from openssl (e.g., see: https://docs.huihoo.com/doxygen/openssl/1.0.1c/crypto_2ossl__typ_8h_source.html).

The code static time_t asn1_time_to_time_t(ASN1_UTCTIME * timestr TSRMLS_DC) { ... } is not even valid C/C++ without the define for TSRMLS_DC which is something like:

#define TSRMLS_D    void ***tsrm_ls
#define TSRMLS_DC   , TSRMLS_D

Maybe using c2cpg with --with-include-auto-discovery or --include <path-to-openssl> with openssl in your system helps.

nekon02 commented 1 year ago

Thank you for the suggestion, after try putting the #define TSRMLS_DC, TSRMLS_D in the source code it works and returns the graph representation now. But when I try to use

joern-parse <source.c> --frontend-args --with-include-auto-discovery
joern-parse <source.c> --frontend-args --include <path-to-openssl>

with OpenSSL in my wsl ubuntu noting changes from the original. In addition, is there a way to use --with-include-auto-discovery in the joern interactive shell or it only possible in joern-parse

max-leuthaeuser commented 1 year ago

What's the output of gcc -xc -E -v /dev/null -o /dev/null on your system? Is the path to the openssl header files in that? --with-include-auto-discovery will only look at these folders.

Where are the openssl header files installed and did you provide the correct path to --include <path-to-openssl> if you used that argument?

Frontend args may be supplied like this: joern> importCode.c("/path/to/your/code", args=List("--something"))

nekon02 commented 1 year ago

I believe I install it correctly. This is the output from gcc -xc -E -v /dev/null -o /dev/null

Using built-in specs.
COLLECT_GCC=gcc
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.3.0-1ubuntu1~22.04.1' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-aYxV0E/gcc-11-11.3.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-aYxV0E/gcc-11-11.3.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1)
COLLECT_GCC_OPTIONS='-E' '-v' '-o' '/dev/null' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu /dev/null -o /dev/null -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -dumpbase null
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/11/include-fixed"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/11/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/11/:/usr/lib/gcc/x86_64-linux-gnu/11/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/11/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/11/:/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/11/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-E' '-v' '-o' '/dev/null' '-mtune=generic' '-march=x86-64'

The OpenSSL header files are in the /usr/include/openssl which I try to use this path for --include /usr/include/openssl but I still get the same result with

joern-parse 177740.c --frontend-args --include /usr/include/openssl/

I think this might be because the text source code in the dataset is from the older version(2018~)

max-leuthaeuser commented 1 year ago

To check that you could have a look into /usr/include/openssl/. Grep for the defines that are missing and see if they are there.

If so, it should be sufficient to run c2cpg with --with-include-auto-discovery as /usr/include is available.

nekon02 commented 1 year ago

Seem like it would be the case that the c header is no longer in the program from openssl as I try to use grep to search for TSRMLS_DC and cannot find any. I think the best solution now is to just add the #define TSRMLS_DC , TSRMLS_D in the source code directly. Thank you very much.

I also have a question regarding the joern-export, as I want to extract the graph representation and use it for machine learning in Python, is there a way to only export the main method(e.g. only asn1_time_to_time_t AST)?

max-leuthaeuser commented 1 year ago

Maybe https://docs.joern.io/export/ helps?

nekon02 commented 1 year ago

Thank you for the link, when I try to follow the method in the link I still can't find a way that I expect. For example with this example code test.c

int myfunc(int b) 
    {
        int a = 42;
        if (b > 10) {
        foo(a)
        }
        bar(a);
    }

and use joern-parse test.c joern-export --repr pdg --out testpdg I will get 6 pdg files. image is there a way to only get the pdg for "myfunc" or show which diagram method is in the file name?