joernio / joern

Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs. Discord https://discord.gg/vv4MH284Hc
https://joern.io/
Apache License 2.0
2.06k stars 284 forks source link

[Bug] Type loss/erasure in compiled domain classes severely hurts interoperability #4796

Open tajobe opened 3 months ago

tajobe commented 3 months ago

Describe the bug When invoking joern as a library (like in joernio/standalone-ext), the loss of type information across many domain classes breaks JVM-langauge interop, particularly with the switch to the EMT to provide property/traversal implicits.

For example, the generated Method node is compiled as (from javap, removing the fully qualified names for readability):

public class Method extends StoredNode implements MethodBase, CfgNode, Declaration

Because it doesn't have StaticType[MethodEMT] when compiled, it lacks all of those traits like HasFilenameEMT or HasFullNameEMT. This means there's no way for a Method node to satisfy the type constraints on traversals like

public final <NodeType extends StoredNode & StaticType<HasFullNameEMT>> Iterator<String> fullName$extension(Iterator);

There are other losses of type information, not exclusive to the EMTs.

For example, AstNode becomes

public interface AstNode extends AstNodeBase

...lacking StoredNode, and thus lacking the .file implicit/extension.

Also, there are several cases where traversals' types get widened from Iterator<Int> or Iterator<Boolean> to Iterator<Object>:

  public Iterator<Object> offsetEnd(); // should return Iterator<int>...
  public Iterator<NodeType> offsetEnd(int); // param correctly an int
  public Iterator<NodeType> offsetEnd(Seq<Object>); // param is a seq of Objects?

To Reproduce Steps to reproduce the behavior:

  1. Look at classes in codepropertygraph-domain-classes_3-1.7.1.jar or generate classes using the flatgraph domain-classes-generator and compile
  2. Use javap or an IDE which decompiles the classes to check signatures/types

Expected behavior Enough information is available in the compiled classes to be able to reasonably use the nodes/traversals/etc

Screenshots N/A

Desktop (please complete the following information): Should be irrelevant here but:

Additional context Add any other context about the problem here.

bbrehm commented 3 months ago

That is a good observation: The new way the APIs are set up suck when used from java.

To give a little context on why we set it up this way:

It is very much intended that a lot of type info is erased. This is because all nodes are ultimately just GNode, i.e. contain a reference to the graph, and a couple of indices to access their fields; both the class pointer and the graph pointer are, ultimately, convenience and the proper encoding of nodes would be Long.

Now suppose you are in a situation where you need to access e.g. cfgNext of CFG nodes -- and the dynamic type (class) of the node is highly variable, as typical when doing a CFG traversal. The old overflowdb had member functions for this, which means dynamic (virtual) dispatch with megamorphic callsite.

The way we set this up, there is no dynamic dispatch depending on the node type. Yay!

=================================

Do I understand your situation right that you want to use joern as a library from java code?

I think the appropriate thing would be for us to have a chat and figure out nice java bindings / a java API? We could generate them along with the scala domain classes.

Is there any small sample code how you use joern from java that you could share?

The way I imagine it, this would be near unusable, due to the prevalence of extension methods / implicit conversions. I.e. most of our APIs look like someTraversal.someComplexStep(), instead of SomeStaticClass.someComplexStep(someTraversal).

This "use dot-access syntax instead of free-standing functions" style is also not my favorite, but it is where we are, for better or worse.

=================================

public Iterator<NodeType> offsetEnd(Seq<Object>); // param is a seq of Objects?

For some unfathomable reason, scala3 decides that Seq[Int] should translate to Seq<Object> instead of Seq[java.lang.Integer]. This is painful! I hate it! It has bitten us as well at some points.

Any ideas @mpollmeier ?

tajobe commented 3 months ago

Thanks for the explanation!

We are actually using Kotlin, though I think Java bindings would be great/better to ensure broad compatibility. Generally we have been wrapping the scala implicits with kotlin extensions (with some intention to eventually codegen this), but there are cases where this is actually just impossible to have compiler checking, eg <NodeType extends StoredNode & StaticType<HasFullNameEMT>> where no node can match the type bounds because no nodes have the StaticType<HasFullNameEMT>. We've had to suppress:

@Suppress(
    "BOUNDS_NOT_ALLOWED_IF_BOUNDED_BY_TYPE_PARAMETER",
    "TYPE_MISMATCH",
    "INCONSISTENT_TYPE_PARAMETER_BOUNDS",
    "UPPER_BOUND_VIOLATED",
    "RETURN_TYPE_MISMATCH",
    "TYPE_INFERENCE_EXPECTED_TYPE_MISMATCH",
    "TYPE_INFERENCE_INCORPORATION_ERROR",
    "ARGUMENT_TYPE_MISMATCH",
    "NEW_INFERENCE_NO_INFORMATION_FOR_PARAMETER",
)

And now we can call it on any T extends StoredNode without compiler warnings/errors (which is a bummer), eg

fun <T : StoredNode> Iterator<T>.fullName(): Iterator<String> = `TraversalPropertyFullName$`.`MODULE$`.`fullName$extension`(this)

There is also TraversalMethodBase$.MODULE$.fullName$extension(this) which gives us the type checking, but it been difficult to use these across the board for all nodes and properties (esp. with things like AstNode not being a StoredNode and some other type loss), so we've had to use the property traversals rather than the node traversals (which would be more convenient to not need to duplicate extensions if the types were there).


Here are some more samples for some context.

Set of extension wrappers:

// <editor-fold desc="AccessNeighborsForMethodTraversal extension implicits">
fun Iterator<Method>.cfgFirst(): Iterator<CfgNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`cfgFirst$extension`(this)
fun Iterator<Method>.methodReturn(): Iterator<MethodReturn> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`methodReturn$extension`(this)
fun Iterator<Method>.literal(): Iterator<Literal> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`literal$extension`(this)
fun Iterator<Method>.parameter(): Iterator<MethodParameterIn> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`parameter$extension`(this)
fun Iterator<Method>.block(): Iterator<Block> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`block$extension`(this)
fun Iterator<Method>.astIn(): Iterator<AstNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`astIn$extension`(this)
fun Iterator<Method>.astOut(): Iterator<AstNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`astOut$extension`(this)
fun Iterator<Method>.callIn(): Iterator<Call> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`callIn$extension`(this)
fun Iterator<Method>.cfgOut(): Iterator<AstNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`cfgOut$extension`(this)
fun Iterator<Method>.containsIn(): Iterator<AstNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`containsIn$extension`(this)
fun Iterator<Method>.containsOut(): Iterator<CfgNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`containsOut$extension`(this)
fun Iterator<Method>.dominateOut(): Iterator<CfgNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`dominateOut$extension`(this)
fun Iterator<Method>.postDominateIn(): Iterator<CfgNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`postDominateIn$extension`(this)
fun Iterator<Method>.reachingDefOut(): Iterator<CfgNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`reachingDefOut$extension`(this)
fun Iterator<Method>.refIn(): Iterator<StoredNode> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`refIn$extension`(this)
fun Iterator<Method>.sourceFileOut(): Iterator<File> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`sourceFileOut$extension`(this)
fun Iterator<Method>.taggedByOut(): Iterator<Tag> = `AccessNeighborsForMethodTraversal$`.`MODULE$`.`taggedByOut$extension`(this)
// </editor-fold>

...and here's a snippet of one of our "parsers" illustrating the usage of our extension wrappers which is attempting to build out a route tree from Spring annotations

    /**
     * Build route tree
     *
     * Hierarchy is composed of:
     * 1. (Optional) Global prefix via application.properties -> context path
     *   - uses first application.properties found, first looking for one with `src/main` in the path
     * 2. (Optional) Controller prefixes (via RequestMapping on Controllers)
     * 3. Endpoint routes (Via *Mapping Spring web annotations
     */
    override fun buildRouteTree(cpg: Cpg): PathNode =
        pathNode {
            val firstAppProperties = resolveAppPropertiesFile(cpg)

            path = firstAppProperties?.getContextPath().orEmpty()

            // build routes for controllers
            cpg.annotation()
                .where(Iterator<Annotation>::typeDecl)
                .fullNameExact(
                    "org.springframework.stereotype.Controller",
                    "org.springframework.web.bind.annotation.RestController",
                ).typeDecl()
                .foreach { controllerRoutes(it) }
        }
bbrehm commented 3 months ago

Thanks for the examples!

Some minor thoughts on the issue:

  1. public interface AstNode extends AstNodeBase -- the issue is that java StoredNode is a class, so in java the interface AstNode cannot extend it. The appropriate type you want is to StoredNode with AstNode. I'm not sure how well kotlin supports type aliases?
  2. If you wrap in this style anyways, you can insert casts for moving Seq<Object> -> Seq<Integer> and the like, to work around the terrible Seq[Int] -> Seq<Object> lowering by the scala compiler.
  3. The StaticType[...] types: They are fundamentally needed for ad-hoc extensions. To give an example, look at the Assigment / Operator extensions. The general idea is that you can make more fine-grained distinctions in the static type system, and don't need them to be part of the original model / class system: You can have Call with StaticType[Assignment] as a static type, even though the clazz doesn't know about it. So this is not a type that is runtime-checkable. Nobody prevents you from having a node: StoredNode and writing node.asInstanceOf[StoredNode with StaticType[HasLineNumberEMT]].lineNumber -- if your node happens to be e.g. a FINDING, then it has no line number. This will nevertheless not throw with a classcast exception since everything is erased, instead you will get None.

I think you can get away with pretty OK wrappers even without the StaticType. This is because in joern, for every property access that is valid in the type system, there is always one unique reason why the access is valid. Say x.name: This works on Declaration and on CallRepr. So you can have one extension method for Declaration and another one for CallRepr and be happy.

We cannot do that on the commercial shiftleft / qwiet side: Our proprietary extensions to the domain classes have clashes. To give an example: We have an abstraction LocalLike that guarantees name and encompasses (i.e. is extended by) both Identifier and Local, among others. Now if we simply added extension methods at all points where the name property is guaranteed, we would get an ambiguity error on Local: This is both a Declaration and a LocalLike, and both provide the name property.

However, I don't think we are willing to commit to preserving that nice "everything for a single reason" property of the open-source joern / codepropertygraph domain schema in the future -- or what do you think @mpollmeier @fabsx00 ?

How would you @tajobe feel about generating kotlin wrappers in the flatgraph codegen? I think you would be stuck writing that code, but if you write good unit tests we can probably maintain it.

If you want to get fancy, we might also go for macro / annotation based wrappers for non-autogenerated code (i.e. traversal steps / extensions in e.g. semanticcpg)?

mpollmeier commented 3 months ago

Thank you for bringing up this very valid point!

My current thoughts on this: generating additional wrappers for other languages or macros etc. sounds like additional complexity and maintenance burden I'd very much prefer to avoid.

The StaticType trick with erased marker traits is very neat, but it only works for Scala and is only used in a few (albeit useful and showcasey-cool) cases. I haven't tried this (and also not properly thought this through), but why don't we move the marker traits for the schema-defined types back to the non-erased traits, i.e. instead of Method extends StoredNode with StaticType[HasFullNameEMT] we do this, like we did in overflowdb: Method extends StoredNode with HasFullName

And then the code generator could define extension methods for both HasFullName and StaticType[HasFullNameEMT] that define the .fullName steps.

That way, we have these marker traits available in the bytecode, and still support the neat usecase for ad-hoc extensions. Thoughts?

tajobe commented 2 months ago

@mpollmeier That would certainly improve the situation for us. I do think there's probably some case for Java bindings for models/traversals for broad JVM interoperability (there are still issues around scala's type system, implicit parameters and functions, etc), but there is certainly added complexity to building and maintaining that.

If the types situation can be improved, we'll likely go the route of code-genning our own wrapper extensions rather than forking the schema/domain classes generator entirely. I'm thinking about making it more generic, eg a scala-kotlin-extension-generator that finds static ...$exension(T $this, ...) functions and generates a Kotlin extension for it (with some way to inject behavior/overrides for places where the generated function might not be compilable or usable). I'll try to follow-up here with more once we have something to show.