joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
2.1k stars 288 forks source link

Poor CPG representation #5098

Open llooFlashooll opened 3 days ago

llooFlashooll commented 3 days ago

Hi folks, I really appreciate your work, and realize Joern highly depends on this package.

However, I find the representation of this CPG is poor. For example, here are my practices,

I want to implement heavy static analysis based on your work. However, I was stuck in the early stage. If you can give some suggestions, I would really appreciate it! Thank you very much!

Here is the reference: https://github.com/Fraunhofer-AISEC/cpg. They support further static analysis.

max-leuthaeuser commented 1 day ago
  • This package cannot neatly return a tree for the CFG such as cpg.cfgNode.toList. If we want the tree, we need to select a starting node and then perform a DFS/BFS to get a tree on our own;

Joern’s CFG representation is designed to model program flow in a way that aligns with actual control flow in software, which is inherently non-tree-like. CFGs are often directed graphs with cycles due to loops and other control flow constructs, so they naturally resist a strict hierarchical tree structure. Instead, Joern provides flexible traversal methods that gives control over how to walk through the CFG. This includes DFS, BFS, and other traversal methods that suit different analysis goals. Joern’s API is modular, so you can easily build custom traversals based on the specific needs of your analysis.

  • The AST representation is too simple. Compared to the Python native AST library, this package only supports limited syntaxes. For instance, if there is an assignment statement, this package cannot return the left and right hand side of the statement.

Joern’s AST representation prioritizes cross-language support and compatibility with its CPG model. This model emphasizes program structure in a way that can generalize across languages, which sometimes requires a trade-off in language-specific AST details. For example, representing assignments as atomic operations is done to normalize operations across languages, especially for cross-language analysis tasks. Joern’s CPG model includes both data flow and control flow edges. This allows you to perform deeper analysis that can distinguish between LHS and RHS in assignments by exploring data dependencies, rather than syntactic parsing. By combining Joern’s AST with data flow and control flow analysis, you can perform rich, multi-level queries that surpass what a traditional AST alone would provide. But also pure AST-based queries / traversals are available.

  • The traversal on CPG is not good. this package just dumps all the things without enough logical connection on any representation such as AST, CFG, call graph.

Quite the opposite is true. Joern’s CPG model is intentionally designed as an interconnected graph that combines AST, CFG, call graphs, and data flow into a unified structure. This model is actually one of Joern’s key advantages, as it allows traversing across these different representations in a single query. If you wish to isolate specific components, like only an AST or only a CFG, Joern provides API calls to retrieve each representation individually or combined as required. The flexibility of Joern’s DSL and the modular CPG design means you can perform specific queries across AST, CFG, and data flows, going beyond the capabilities of isolated graphs.

DavidBakerEffendi commented 1 day ago

Just to add on from what @max-leuthaeuser says:

This package cannot neatly return a tree for the CFG such as cpg.cfgNode.toList. If we want the tree, we need to select a starting node and then perform a DFS/BFS to get a tree on our own;

CFG's aren't rooted trees, as they can have loops. In any case, something like this can trace a simple path in the CFG:

cpg.method.cfgNode.enablePathTracking.simplePath.map(x => x.label -> x.code).toList

There is an open PR on these traversals lifted from the graph database's source code: https://github.com/joernio/flatgraph-docs/pull/2

For instance, if there is an assignment statement, this package cannot return the left and right hand side of the statement.

Have a look at the Assignment trait. A call that is an assignment operator can be cast via the .assignment step, thereafter you can access the target (LHS) and source (RHS) steps. e.g. cpg.method("foo").assignment.target.l will give you all the LHS of assignments in a method called "foo".


Furthermore, I do however think that the AISEC work has merit, and last I checked they were exploring some novel work that we simply don't do like weighted pushdown systems (maybe even typestate analysis), however I could not get that running when I pulled the project locally.

It is largely aimed at research, so scalability and practicality is not at the forefront of their concerns, e.g., they use Neo4j as their storage backend which is an inherent limitation on resources when compared to this package's backend, flatgraph.