veracitylab / provenance-injector

inject provenance into JEE applications
Apache License 2.0
0 stars 0 forks source link

Handle `EntityRef.THIS` entity sources for tracked entity creation #28

Closed wtwhite closed 1 month ago

wtwhite commented 1 month ago

Currently we assume that whenever a tracked entity creation occurs, the entity target must be the method's return value (EntityRef.RETURN), which is picked up by injecting a call to captureTarget() before all non-void RETURN JVM instructions (ARETURN, etc.). But this is not necessarily the case -- e.g., constructors (which have method name <init>) are actually passed in the freshly-allocated this as the 0-th parameter like a regular non-static method and have a void return, despite the x = new Foo() syntax suggesting otherwise.

This only considers entity creation for now, since entity propagation is not yet implemented: #27. Will involve changes to approv because of https://github.com/veracitylab/approv/issues/2.

Needed for #26.

wtwhite commented 1 month ago

A related wrinkle is that recordActivity(), which is called not only to record an activity but also to association an entity with it, currently also assumes that the relevant target is stored in the tracked method's return value. I think sorting this out properly probably requires a redesign, or more time for me to understand what the intended semantics actually are, and how they differ from the current implementation, as the Right Thing to do isn't clear to me:

  1. Add target and targetIndex fields to ActivityMappings, just like for EntityMappings? But this appears to (a) duplicate most of what's in EntityMapping (so why have separate ActivityMappings at all?) and also (b) it's tricky to inject bytecode to handle EntityRef.ARG at the callsite (since TTBOMK there's no random access into the stack, and calling a special utility method beforehand to make arguments them accessible as local variables erases them from the stack).
  2. Get the right location of the target by looking at corresponding EntityMapping entries, which is possible now that EntityCreations also keep track of target and targetIndex? This seems nice, and avoids duplication. But it again calls into question why ActivityMappings exist at all, since everything can now be done with just EntityMappings.
  3. Something else?

However, provided we only want to deal with activities that either return their associated entity (as is already handled) or create it using a constructor (i.e., if we don't care about EntityRef.ARG), this is moot: the JVM happens to handles both cases in such a way that the stack looks the same after the method or constructor returns, i.e., the desired target object is on the top of the stack either way. IOW, we get lucky here and don't need to fix this properly just yet 🤪

wtwhite commented 1 month ago

First attempt to collect this (f42a1e006fd6d579debbcd4e336b9e8b440ddeba) fails JVM verification because we try to do it too early, before it's been initialised (which I thought would be fine since the address wouldn't change):

--snip--
java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    org/apache/hc/client5/http/classic/methods/HttpGet.<init>(Ljava/net/URI;)V @11: invokestatic
  Reason:
    Type uninitializedThis (current frame, stack[0]) is not assignable to 'java/lang/Object'
  Current Frame:
    bci: @11
    flags: { flagThisUninit }
    locals: { uninitializedThis, 'java/net/URI' }
    stack: { uninitializedThis, 'java/lang/String' }
  Bytecode:
    0000000: 1226 1228 2bb8 002e 2a12 28b8 0034 2a12
    0000010: 022b b700 03b1                         

    at nz.ac.canterbury.dataprovenancedemo.controllers.LibraryController.getHtmlUsingApacheHttpClient(LibraryController.java:190) ~[classes!/:na]
    at nz.ac.canterbury.dataprovenancedemo.controllers.LibraryController.rateMovie(LibraryController.java:167) ~[classes!/:na]
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
--snip--
wtwhite commented 1 month ago

This might be tricky -- I expect that the JVM won't update the type of the object from uninitializedThis to the final object type until the constructor returns, which means that attempting to capture it just before a RETURN will still be "too early". That's one way to interpret the relevant part of the JVM spec, and, e.g., this warns against doing so.

But if that were the case, then the following class should fail verification:

public class SomeClass {
    public SomeClass() {
        System.out.println("SomeClass ctor");
        aMethod(this);
        System.out.println("Returned from aMethod(this).");
    }

    public static void aMethod(SomeClass s) {
        System.out.println("aMethod(" + s + ") called.");
    }
}

But it compiles fine (relevant javap -v -p output):

  public org.example.SomeClass();
    descriptor: ()V
    flags: (0x0001) ACC_PUBLIC
    Code:
      stack=2, locals=1, args_size=1
         0: aload_0
         1: invokespecial #1                  // Method java/lang/Object."<init>":()V
         4: getstatic     #2                  // Field java/lang/System.out:Ljava/io/PrintStream;
         7: ldc           #3                  // String SomeClass ctor
         9: invokevirtual #4                  // Method java/io/PrintStream.println:(Ljava/lang/String;)V
        12: aload_0
        13: invokestatic  #5                  // Method aMethod:(Lorg/example/SomeClass;)V
        16: getstatic     #2                  // Field java/lang/System.out:Ljava/io/PrintStream;
        19: ldc           #6                  // String Returned from aMethod(this).
        21: invokevirtual #4                  // Method java/io/PrintStream.println:(Ljava/lang/String;)V
        24: return

and runs fine, so maybe everything is OK verification-wise after the nested call to the java.lang.Object constructor completes. In that case, it should be possible to safely grab this with an aload_0 just before each return 🤞

wtwhite commented 1 month ago

Finally with 73953e7a124bc6cb3d2792dcefa2187a86b9a2c2 we are seeing the URL passed to HttpGet's constructor get pulled down in the provenance data as associated entity data, although for some reason it appears twice with different IDs:

wtwhite@wtwhite-vuw-vm:~$ curl -i http://localhost:8080/prov/26
--snip--
    {
        "activities": [
            {
                "id": "7f6af131-d727-4573-aa02-1f3e77250bc1",
                "type": "MakeHttpRequestApacheNew",
                "endTime": "2024-05-29T09:40:48.178480Z"
            }
        ],
        "associatedEntities": [
            {
                "id": 589882250,
                "type": "HttpRequestApacheGet",
                "value": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                "target": {
                    "headers": [],
                    "method": "GET",
                    "path": "/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                    "scheme": "https",
                    "authority": {
                        "userInfo": null,
                        "port": -1,
                        "hostName": "app.veracity.homes"
                    },
                    "version": null,
                    "requestUri": "/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                    "entity": null,
                    "cancelled": false,
                    "config": null,
                    "aborted": false,
                    "uri": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4"
                }
            },
            {
                "id": 1082006511,
                "type": "HttpRequestApacheGet",
                "value": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                "target": {
                    "headers": [],
                    "method": "GET",
                    "path": "/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                    "scheme": "https",
                    "authority": {
                        "userInfo": null,
                        "port": -1,
                        "hostName": "app.veracity.homes"
                    },
                    "version": null,
                    "requestUri": "/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4",
                    "entity": null,
                    "cancelled": false,
                    "config": null,
                    "aborted": false,
                    "uri": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=35&stars=4"
                }
            }
        ]
    },
--snip--
wtwhite commented 1 month ago

The problem is caused by the fact that we call an HttpGet constructor that takes a String, which itself builds a URI from that and calls another constructor that takes a URI, and captureTarget() gets called on both. The return values of the two nested constructor calls are the same object, so even though recordActivity() only gets called after the top-level constructor returns (because we don't inject them for calls between methods in the same class), it finds two copies in the AssociationCache's cache somehow.

Changing the top-level constructor call to the version that takes a URI directly eliminates the problem:

--snip--
    {
        "activities": [
            {
                "id": "a2f3e78c-ad67-4122-8ddb-8365235acf27",
                "type": "MakeHttpRequestApacheNew",
                "endTime": "2024-05-29T10:01:00.623219Z"
            }
        ],
        "associatedEntities": [
            {
                "id": 133914510,
                "type": "HttpRequestApacheGet",
                "value": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=47&stars=4",
                "target": {
                    "headers": [],
                    "method": "GET",
                    "path": "/omar-notifications-main-menu.html?name=jens&movieId=47&stars=4",
                    "scheme": "https",
                    "authority": {
                        "userInfo": null,
                        "port": -1,
                        "hostName": "app.veracity.homes"
                    },
                    "version": null,
                    "requestUri": "/omar-notifications-main-menu.html?name=jens&movieId=47&stars=4",
                    "entity": null,
                    "cancelled": false,
                    "aborted": false,
                    "config": null,
                    "uri": "https://app.veracity.homes/omar-notifications-main-menu.html?name=jens&movieId=47&stars=4"
                }
            }
        ]
    },
--snip--

I'm not yet sure what the best way to solve this in general would be, but for now we can just use this hack. (A different workaround would be to add a "descriptor": "(Ljava/lang/String;)V" field to the entity-mapping spec for HttpGet.<init> in bind-jdbc.json to only capture a target for the top-level constructor call, but then if we were to call the URI constructor, the URI argument for this call would not be tracked.)

wtwhite commented 1 month ago

The reason 2 copies of the HttpGet object get stored in the AssociationCache's targetsToEntitiesCache when the constructor that takes a String is called is because each entity has an id, which is always set to the hash value of the parameter used to create that entity -- and that hash value will be different for the 2 HttpGet constructor calls, since one is a String and the other is a URI, even though they both return the same object!