Call Graph seems to be missing a lot of "obvious" entries?

ChuckFoo commented 6 years ago

Whilst debugging problems in some code that someone else wrote that uses Soot to do some analysis, I discovered what appears to be a fair number of "obvious" methods missing from the CallGraph (and of course ReachableMethods). And by "obvious" I mean if I look at all of the ReachableMethods bodies for InvokeExpr and check those methods, a fair number of those methods in some of my test cases end up not being in ReachableMethods themselves...

I began digging into the Spark code to try and figure out why, but that code is fairly complex so I figured I should probably write an issue and provide code & a test case so someone more familiar with Soot internals than I am can take a likely more effective look as well. :)

The code (analysis pass and simple main driver) is below, and I ran it (using a recent Java 8 under a moderately up to date Ubuntu 16.04) thusly against the jEdit 5.4.0 installer (a convenient simple test case) like so:

java -cp ~/Downloads/sootclasses-trunk-jar-with-dependencies-20180323.jar:. test.CGTestMain -w -app -p cg verbose:true -p cg.spark enabled:true,verbose:true -verbose -debug -soot-class-path ~/Downloads/jedit5.4.0install.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar -process-dir ~/Downloads/jedit5.4.0install.jar >& cgtest.txt

And I got 2502 "directly missed" InvokeExpr methods (11 of which are definitely from the application), and a similar amount "indirectly missed" calls (the latter being InvokeExpr called from the "missing" methods directly called from known ReachableMethods, and less surprising to have been missed because of that). Doing a similar run against the actual installed jedit.jar file gave similar results (and showed me that I needed more error handling in the code).

I have a workaround in the form of a variant of the (simplified) code I'm using to show the problem here (and I needed to do something similar to propagate some other info I needed to work with anyways) but wanted to report this and see if someone more knowledgeable could figure out what is going on here and if it's a real problem in Soot or not, since I imagine a lot of people could benefit from a more complete/correct CallGraph.

Thanks! Chuck

Here's the code, assuming this attachment works (github didn't like me trying to upload the raw Java files so I had to zip them up): cgtest.zip

ericbodden commented 6 years ago

Hi. This is likely not a bug but a misuse on your side and therefore this is not the best place to discuss this. I'd rather recommend using our mailing list of StackOverflow.

You did use -p cg.spark verbose:true What output did Soot generate on the command line? That should help you understand the reasons for the effect you are seeing?

ChuckFoo commented 6 years ago

Hi Eric, I tired the Soot mailing list a while back when I first started investigating oddities with the code I was given to work with, and I didn't get any response at the time (although back then my misunderstanding of what was happening made me think that the Jimple bodies weren't being generated for some methods prior to whole program analysis phases firing, which turned out to be wrong).

So I tried digging through Soot (and the code calling it) by myself for quite a while (not continuously though, been revisiting off & on as I left and returned to working with this particular project), and didn't get a lot of traction with figuring out what was happening (either through turning up the verbosity or digging through the Spark code), which is why I figured I should just write an issue now. And much like with the last issue I reported, I tried to provide a code snippet and simple example to run on to make it easy to reproduce by others who understand the inner workings of Soot better than I.

I suppose that it might be a "misuse" on my side (there certainly was some of that in the code I was given to work with initially and I very well may have made some incorrect assumptions about how things are supposed to work in Soot as well with the changes I've made since then), and I'd be fine with that as an answer if it's true. :)

But my test driver code isn't really doing anything overly complex, just examining the data structures that Soot is building and looking at InvokeExpr in method bodies to see if they do or don't appear in ReachableMethods. Let me point at something specific here that makes me still believe there's a Soot/Spark bug here...I just made a minor change to my test code to help make digging through the jEdit installer source for a location that I believe exhibits this issue by printing out what ReachableMethods seem to be calling unreachable ones...okay, here's one example from analyzing the jEdit installer like I referenced above in this issue:

[CGTest] - REACHABLE method <installer.TarOutputStream: void write(byte[],int,int)> apparently calls UNREACHABLE method <installer.TarBuffer: void writeRecord(byte[])> [CGTest] - REACHABLE method <installer.TarOutputStream: void writeEOFRecord()> apparently calls UNREACHABLE method <installer.TarBuffer: void writeRecord(byte[])>

And looking at the jEdit 5.4.0 source code file installer/TarOutputStream.java at the reachable methods above, which do appear in ReachableMethods and CallGraph, I can see that:

The "reachable" write method above isn't overly complicated and contains a call to the supposedly "unreachable" this.buffer.writeRecord, and buffer is simply an installer.TarBuffer instance...so there's no weird indirection or anything like that happening that I can see, just looks to be a relatively straightforward call to a method that for some reason is absent from the CallGraph and ReachableMethods.
The "reachable" writeEOFRecord code is even simpler, and also just calling the "unreachable" this.buffer.writeRecord.

Perhaps I am doing something wrong, but from this simple test and manual inspection it still looks like a bug to me. I'd be happy to be proven wrong however. I've attached the minor update to my test code below, and the URL for the jEdit installer I ran against is here and the source for it is here if you (or anyone else) would like to inspect it closer yourself (both for any mistakes I might be making in my test driver code and/or reasons why the jEdit source code might have things being legitimately excluded from the CallGraph/ReachableMethods).

Thanks, Chuck

Here's my updated test code: cgtest.zip

ericbodden commented 6 years ago

@ChuckFoo I had a brief look at the method <installer.TarOutputStream: void write(byte[],int,int)> and the surrounding class. It is called by this statement:

this.buffer.writeRecord( wBuf, wOffset );

I am quite positive that the believed unreachability of <installer.TarBuffer: void writeRecord(byte[])> is caused by the fact that Soot (respectively its points-to analysis framework Spark) thinks that the points-to set of this.buffer. Spark things that this.buffer is null at this point, which is why the call can never happen. Whether the assessment that this.buffer is null is correct or not, that I cannot answer straight away.

Looking at your command line, though, I see that you do not explicitly set a main-class for Soot's call-graph analysis. Soot's command-line output should show you which main class Soot selected. Is this the correct one? Without the correct main class, Soot/Spark will analyse the code in the wrong context, which would explain the results you see,

ChuckFoo commented 6 years ago

Doh! I sometimes forget to specify the main class because when I first started doing this I assumed that Soot would pull the info out of the manifest in the jar given to -process-dir, but at some point with various tests I realized that it was picking an incorrect starting point and so I tried to remember to specify it explicitly each time after that...but as you see I do occasionally forget still.

So I just reran with that explicitly specified (-main-class installer.Install) and based on the statistics I output (eg: same numbers for "missed" methods) and looking specifically for the above methods again it appears to get the same results as without it. In previous tests I saw the specific problematic things change, but not disappear, when I specified the correct one explicitly so I'm guessing that Soot guessed correctly this time.

I could try to dig deeper into the Spark analysis, but as that stuff is moderately complex, are there any hints you can give me as to either options to specify to get more ideas of what's happening here, or point me to likely good places to add diagnostic output in the lower level Spark code? I actually started down that latter path before writing this issue, but figured/hoped that someone who understood Spark already might look at the info I provided and be able to come to a conclusion about what was going on significantly faster than I could flail my way through it so I wrote this issue before continuing down that path. :)

Thanks.

ChuckFoo commented 6 years ago

So, I have some more data on this...I was looking at the jEdit installer source to try and wrap my head around it some, and that buffer object is instantiated by 1 of the 3 constructors in TarOutputStream (and the other 2 call that one) so it shouldn't be null unless none of the constructors are called...which actually appears to be the case.

So why are any of it's methods ending up in the CallGraph & ReachableMethods? My guess is it's because TarOutputStream extends FilterOutputStream, and the CallGraph shows a couple of JDK methods calling into a couple of the TarOutputStream methods, so perhaps the solver decides that those could be possible targets so they get added to the CallGraph even though they aren't really? But then, since none of the constructors are found in the CallGraph, this.buffer isn't "set", so that's why the methods on buffer that those methods call aren't seen as being reachable?

That sound like reasonable guesses as to what's happening here? If so, then the problem is slightly different than I originally reported....

mbenz89 commented 6 years ago

This sounds like if spark cannot find an allocation side for TarOutputStream. Can you check if spark is able to find an allocation side of the object and propagate it to the stated member call?

After doing a short grep on the source code I could not find any instanciations of the TarOutputStream class. If this really is the case, spark can of course not find out that the internal buffer is not null and thus won't find a call here. On the other hand, it is questionable why there are calls to TarOutputStream.write in the call graph if it is never instantiated. Can you double check this?

ChuckFoo commented 6 years ago

Yeah, that's basically what I was trying to say I now believed was happening in my last comment above, and that I was theorizing that Spark using a solver to help reason about potential call targets for virtual and interface type calls possibly being part of the reason why they end up in the CallGraph.

As far as I can tell a TarOutputStream isn't directly instantiated anywhere in the code.

As for checking object allocations, how can I get Spark to tell me about them? Because despite having all those verbose and debug options on (see below for the command line I used) the only messages I see before my test code (attached in a previous comment above) fires in the wjtp phase are about 70 lines that are all variants of these 3 messages:

[main] WARN soot.jimple.toolkits.callgraph.OnFlyCallGraphBuilder - Method <sun.security.jca.Providers: java.security.Provider getSunProvider()> is reachable, and calls Class.newInstance; graph will be incomplete! Use safe-newinstance option for a conservative result.
[main] WARN soot.jimple.toolkits.callgraph.OnFlyCallGraphBuilder - Class sun.security.provider.VerificationProvider is a dynamic class, and you did not specify it as such; graph will be incomplete!
[main] WARN soot.jimple.toolkits.callgraph.OnFlyCallGraphBuilder - call to java.lang.reflect.Method: invoke() from <java.lang.Class: java.lang.Object[] getEnumConstantsShared()>; graph will be incomplete!

I doubted that safe-newinstance option would make much of a difference in the problem I'm seeing here, but just in case I did just redo the test again with that specified like so:

java -cp ../lib/sootclasses-trunk-jar-with-dependencies-20180323.jar:. test.CGTestMain -w -app -p cg verbose:true,safe-newinstance:true -p cg.spark enabled:true,verbose:true -verbose -debug -soot-class-path ~/Downloads/jedit5.4.0install.jar:/usr/lib/jvm/java-8-open
jdk-amd64/jre/lib/rt.jar:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/jce.jar -process-dir ~/Downloads/jedit5.4.0install.jar -main-class installer.Install >& cgtest.txt

That reduced the number of messages that come out before my test diagnostics to only about 20 of the other two warning types mentioned above, and did change the numbers of my diagnostic output a little, but as I expected the same methods on TarOutputStream are still marked as reachable (as are some methods on a couple of other classes that also I don't believe are actually instantiated anywhere).

So any suggestions on how to get more diagnostics out of Spark (options to use, good places to insert new diagnostic messages, etc) to help figure out what is happening would be appreciated.

mbenz89 commented 6 years ago

As far as I remember Spark's implementation, it needs an allocation side of an object to resolve the object's call sites at all. Maybe it can also handle factory methods in some cases (if they are phantom?), I am not quite sure about this. Maybe @ericbodden can clarify this?

Regarding some tips for debugging, I would certainly start with having a look at the points-to-set of the object that is used as the base object of the call to TarOutputStream.write. You should be able to find the origin of that object by querying the points-to-set.

ericbodden commented 6 years ago

I am afraid I also don't know of a good way not debug such things. It's always painful, manual labour.

What I can tell you is that Spark will omit call edges when the points-to set for the receiver is empty. But I cannot really say why this is the case here. Sorry

soot-oss / soot

Call Graph seems to be missing a lot of "obvious" entries? #923