rohanpadhye / vasco

An inter-procedural data-flow analysis framework using value-based context sensitivity
GNU Lesser General Public License v2.1
88 stars 35 forks source link

Choices of Implementation wrt your Original Paper #7

Closed oparisy closed 4 years ago

oparisy commented 7 years ago

While studying your "Interprocedural Data Flow Analysis in Soot using Value Contexts" paper, I took the liberty to compare the algorithm described in Figure 1 with the Java implementation your provided in this repository.

Most differences are Java implementation details, extensions (such as the multiple entry/exit points, hinted in the original paper) or classically implied behaviors (such as the topological sort to ensure fast convergence). Two of them stroke me as more fundamental, though.

The first one is the choice of associating a worklist to each context, while retaining the global contexts worklist. This significantly impacts the algorithm by introducing a null guard object and an analyzed flag on contexts. This flag, as I understand it, alters the criteria on line 28 of the original algorithm by excluding X' contexts whose worklist has not already been processed.

Is this change merely for implementation convenience, or does it have a deeper meaning? It seems to relate to the "intraprocedurally eager" criteria of the "Efficiency, Precision, Simplicity and Generality in Interprocedural Data Flow Analysis" paper.

Could you shed some light on this?

A second, minor difference is the lines of code stating "Merge with previous OUT to force monotonicity". I do not know this result (at least I did not notice it in the "Data Flow Analysis - Theory and Practice" book).

Do you have any pointer to a paper describing this technique?

Regards, Olivier.

rohanpadhye commented 7 years ago

Hi Olivier,

Thanks for looking through the implementation. You are spot-on about the differences between the code and the algorithm in the paper. The differences are mainly optimizations, which we deemed too tedious to explain in a 6-page workshop paper. Let me address each of the points you brought up.

  1. In the code, we maintain a global worklist of contexts and context-local worklists of nodes as opposed to a global worklist of (context, node) tuples as described in the paper.

If you look closely, the workings are not very different from what is described in the paper:

For work-list removal, we will use lexicographical ordering of contexts (newer first) before nodes (reverse post-order).

Each context-local worklist uses reverse post-order (i.e. topological sort) and the worklist of contexts is a set of ordered items, from which the newest item is removed for processing. This makes the two worklist implementations equivalent.

  1. There's a null guard in the implementation.

The null guard is an implementation detail that was introduced simply because Soot's control-flow graphs have multiple exit nodes called tail nodes. By adding an artificial null node to the worklist after any tail node is processed and giving it the lowest priority, we can (in most cases) set the exit value after tail nodes have been processed. You can get rid of this optimization and compute the exit value after each tail node is processed, but that's just a lot of redundant joins.

  1. The analysed flag on contexts and the use of the isAnalysed() method alters X' on line 28 of Figure 1 in the VASCO paper.

You are absolutely right about this one. We did not use this flag in the paper at all. However, this is an optimization and not crucial to the correctness of the algorithm. Consider a mutually recursive program with call chain main -> f(0) -> g(1) -> f(0). When analyzing the call from g to f(0), we notice that such a context already exists but its exit value is still the default TOP value. What order should we analyze the contexts corresponding to f(0) and g(1)? The answer is that it doesn't matter. Since they are in a cycle, we have to keep computing their effect on each other until a fixed point is reached. The paper uses a straightforward approach of treating all contexts the same to simplify presentation. In the implementation, we note that if we have not yet analyzed at least one path from the entry of f to its exit in the context corresponding to f(0), then the exit value is quite useless (i.e. it will be TOP) -- we therefore decide to wait until such a path is analyzed. We assume that there is at least such a path that does not again call g(1) before returning from f(0), otherwise the program is performing an unconditional recursive loop -- in which case data-flow analysis is pointless. Note that when such a path to the exit of f(0) is analyzed, it will add the return node inside g(1) back to the worklist due to the edge in the context-transition graph.

  1. The code performs out = meet(out, prevOut).

For convergence of data-flow analysis, the OUT value at each node should only change in one direction (i.e. it can become weaker -- change towards BOTTOM). This monotonicity is crucial to proving convergence, and is described in the book you referenced. If the flow functions are themselves monotonic, this operation doesn't change anything because meet(out, prevOut) is equal to out if out is equal to or weaker than prevOut, which is the case for monotonic flow functions.

However, statically proving that flow functions implemented in Java are monotonic is infeasible. By forcing the meet with prevOut, we ensure that the VASCO algorithm terminates even if the provided flow functions are not monotonic -- though the analysis will probably not be very useful. Without this step, a non-monotonic flow function may cause OUT values at a node to oscillate between two values and the worklists would never become empty.

Hope this helps. Let me know if something was not clear or if something that I said seems incorrect.

Rohan

oparisy commented 7 years ago

Hi Rohan,

thanks for your swift, detailed answer!

I definitely did not notice the description of the worklist removal order in the example description. I understand now this is an implementation of the "interprocedurally eager" constraint.

The "forced monotonicity" meet is an interesting trade off, considering dataflow problems are arbitrarily specified in Java. I suppose this would not have been needed should you have provided a closed set of provably-monotonic problems, but then this would not be a general framework anymore... Out of curiosity, did you evaluate the cost of those extra meets when using an actually monotonic Java specification (which as you describe should be the general case)?

My understanding of the "isAnalyzed" extra criteria is that your want to limit the number of useless invocations of meet, callLocalFlowFunction and callExitFlowFunction in the presence of a default TOP. I reckon that explicitly testing for TOP instead of isAnalyzed would be too strong of a criteria (since the exit value of this context could actually be TOP after analysis). In this line of thought, would modifying the lattice and distinguishing between a "not analyzed" (the new TOP) and a "no information" (the current TOP) value would simplify the algorithm (at the expense of more complex flow functions and meet operator)?

I suppose an "on demand" (instead of "newest") evaluation criteria for contexts would not cut it also, due to the recursive nature of the example you are providing. By this I mean performing an analysis of f(0) during g(1) analysis, when noticing that g(1) invokes f(0) which is not analyzed so far. Since this analysis of f(0) would be dependent of g(1), that would just mean a more complex, partially recursive algorithm, which the worklist(s) approach avoids.

I have an extra question with respect to call strings. The "Efficiency, Precision, Simplicity, and Generality..." paper and the "Data Flow Analysis" book go to great lengths to describe criterias for "representing" (factorizing at entries) and "regenerating" (expanding at exits) related call strings, by taking advantage of value-based equivalence classes.

Even if the call strings number is massively reduced due to this approach, it seems like they are still kept as actual strings (ie, explicit lists of call nodes).

Interestingly, no such explicit lists seem present in your paper and implementation. Instead, you are tabulating X=<method, entryValue> value contexts (where "method" is not a list but what would be the call string tail). Is the "transitions" map some kind of graph representation of those call strings? Or did you determine that explicit call strings were not mandatory, and that invoked-method-based tabulating would yield the same results as an "explicit call strings" approach?

Olivier.

rohanpadhye commented 7 years ago

Hi Olivier,

Out of curiosity, did you evaluate the cost of those extra meets when using an actually monotonic Java specification (which as you describe should be the general case)?

No, I have not done that. It would be interesting to see what the performance impact of that would be.

I should point out that there is a subtle case that actually requires the self-meet on return values. The problem arises when the value context just before a function invocation changes (perhaps the invocation was inside a loop). The new context could have a stronger value at its exit (TOP by default, or something else even if some path to its exit was analyzed) than the exit value of the old context.

I'm not sure I can come up with a detailed example off the top of my head, but the general pattern is something like:

// analyzing when a = BOTTOM
function f(a) {
  x = 0;
  while (...) {
     f(x); // First time calls f(0); second time calls f(BOTTOM).
     x++;
  }
}

To summarize, there is likely an error on line 32 in Figure 1 of the VASCO paper. The correct statement should be OUT(X, n) = OUT(X, n) ∩ b1 ∩ b2.

This meet can be avoided by creating new contexts using a clone operation instead of fresh construction whenever the calling context at a given invocation node changes (i.e. becomes weaker). The clone operation would initialize all nodes in a value context's control-flow graph with data-flow values from its source context... Thereby preserving monotonicity throughout.

In this line of thought, would modifying the lattice and distinguishing between a "not analyzed" (the new TOP) and a "no information" (the current TOP) value would simplify the algorithm (at the expense of more complex flow functions and meet operator)?

Possibly. The resulting effect on the order of pulling nodes out of the worklist appears to be same as the current implementation of VASCO. However, you would now need client analyses to be aware of these two separate values in their definition of the data-flow value domain (generic type A in the implementation of the VASCO classes). We've tried to avoid going down that route but there certainly may be a way of implementing this more elegantly.

Is the "transitions" map some kind of graph representation of those call strings? Or did you determine that explicit call strings were not mandatory, and that invoked-method-based tabulating would yield the same results as an "explicit call strings" approach?

You are absolutely right.

The value-contexts formulation is an implementation of the functional approach, which is a dual of the call strings approach, as outlined by Sharir and Pnueli in the classic (aptly-named) 1978 article Two approaches to interprocedural analaysis.

However, the reason I prefer the value-contexts approach is that it gives you the best of both worlds. The context-transition graph constructed at the end of the analysis is a data structure that allows querying for the context for any call string, which can be arbitrarily long. Think of the graph as a finite state machine where each node represents a regular expression of all call strings for which it corresponds.

I believe this is more powerful (in terms of querying power) than the extended call strings approach (Khedker & Karkare, 2008). You don't necessarily have to do a meet over all interprocedurally valid paths -- you can keep the contexts and their analysis results separate until you really need them. For example, if you were to use the results of pointer analysis in a subsequent analysis, you don't have to ask "dox and y ever alias in method M?"; instead, you can ask "do x and y ever alias in method M when called from some calling context X?".

This ability to query the context for any call string can also be powerful if you were to use the results of data-flow analysis at run-time. For example, if you were to pause a program in a debugger and query its call stack, you would know exactly what value context that stack frame belongs to because you could walk the context-transition graph starting from main() all the way up the stack. This was partly the motivation for the invention of value contexts, as I was trying to use ahead-of-time data-flow analysis to guide dynamic optimizations. Unfortunately, that did not end up being very practical.

oparisy commented 7 years ago

Rohan,

thanks for having taken the time to detail those points for me.

However, you would now need client analyses to be aware of these two separate values in their definition of the data-flow value domain (generic type A in the implementation of the VASCO classes). We've tried to avoid going down that route but there certainly may be a way of implementing this more elegantly.

Now that I have represented basic problems in Vasco (#8), I can definitely appreciate the simplicity of the framework, and the fact that optimizations are not a concern when implementing solvers.

The value-contexts formulation is an implementation of the functional approach, which is a dual of the call strings approach

Does this mean that the Vasco algorithm is equivalent to the "full" call strings approach, and, as a consequence, is able to solve the non-separable problems that "restricted contexts" approaches cannot handle (as discussed on page 301 of "Data Flow Analysis")?

I am asking this because superficially, the "value contexts" approach feels like "restricting the length of calling context [sigma] to 1" (Data Flow analysis, p 296), but I assume the transitions map makes up for this by compactly representing the full call strings set, since you state that:

The context-transition graph constructed at the end of the analysis is a data structure that allows querying for the context for any call string, which can be arbitrarily long. Think of the graph as a finite state machine where each node represents a regular expression of all call strings for which it corresponds.

Frankly, those results feel like a significant improvement over previous functional and call string approaches (especially from an usability/ease of implementation point of view), and would deserve more exposure. I don't mean to pry, but is there any chance you may publish more about those? Or are you considering other approaches/problems now?

Olivier.

rohanpadhye commented 7 years ago

Now that I have represented basic problems in Vasco (#8), I can definitely appreciate the simplicity of the framework, and the fact that optimizations are not a concern when implementing solvers.

Thanks.

Does this mean that the Vasco algorithm is equivalent to the "full" call strings approach, and, as a consequence, is able to solve the non-separable problems that "restricted contexts" approaches cannot handle (as discussed on page 301 of "Data Flow Analysis")?

I am asking this because superficially, the "value contexts" approach feels like "restricting the length of calling context [sigma] to 1" (Data Flow analysis, p 296) ...

I don't have the book with me at the moment to check those references, but I can tell you that the value contexts approach is definitely not the 1-calling-context approach (also known as 1-CFA in functional programming), which is much simpler and far more imprecise.

The value contexts approach is indeed fully context-sensitive, equivalent to a hypothetical ∞-CFA or unbounded call strings in the classical case or full call strings when using value-based termination. It is designed to work with non-separable problems: If your data-flow problems have distributive flow functions, you are much better off using an implementation of the IFDS framework, such as Heros, since it terminates in asymptotically fewer steps.

Frankly, those results feel like a significant improvement over previous functional and call string approaches (especially from an usability/ease of implementation point of view), and would deserve more exposure. I don't mean to pry, but is there any chance you may publish more about those? Or are you considering other approaches/problems now?

You are right in that the value-context formulation makes the functional approach more usable, but that is subjective and more of an analysis implementor's concern than a theoretical contribution; therefore, I do not have plans to publish more than the material that is already out there (i.e. the SOAP paper and this GitHub repo).

In my opinion, the main limitation with using either the value-context approach or the functional approach or the full call-strings approach is that in the worst case, the run-time is proportional to the size of the data-flow lattice, which is often exponential in the size of the program (or its variables). It also requires the whole program to be analyzed at once, which is sometimes infeasible when using dynamic linking or reflection or partial JIT-compilation.

This type of framework is suitable for performing heavy-weight offline analysis (e.g. where you don't mind spending several hours on analyzing important security properties) for small, self-contained programs (e.g. Android apps or WebAssembly modules).

oparisy commented 7 years ago

I can tell you that the value contexts approach is definitely not the 1-calling-context approach (also known as 1-CFA in functional programming), which is much simpler and far more imprecise.

Nice, thanks for the confirmation.

If your data-flow problems have distributive flow functions, you are much better off using an implementation of the IFDS framework, such as Heros, since it terminates in asymptotically fewer steps. (...) In my opinion, the main limitation with using either the value-context approach or the functional approach or the full call-strings approach is that in the worst case, the run-time is proportional to the size of the data-flow lattice, which is often exponential in the size of the program (or its variables).

Oh! Sadly I tried Heros before, and I am investigating other interprocedural analysis approaches since it did not scale well on the kind of problems I tried to solve. I'll see how it goes...

Regards, Olivier.

manasthakur commented 6 years ago

Hi Rohan,

I too have a doubt in the case when we do not have the results for a method since it hasn't been analyzed completely yet (i.e., we are in an SCC involving the method, isAnalyzed for that method is false). Since you have a nice example above (main -> f(0) -> g(1) -> f(0)), I am posting it in this thread itself.

While analyzing g, since I do not have the exitValue for f yet, I would continue with the analysis of the next statements in g. However, it seems the out in such a case will be top (looking at the implementation), resulting into an empty in for the successors of the call statement.

I am not sure, but can this create a problem (as such an analysis might make unsound assumptions for the successors)? Or is it the case that we expect that g will definitely be analyzed again during the fixed-point computation, and a monotonic analysis will eventually update the unsound results for the successors? I think the latter is true, but a second opinion will clear away my doubts.

Regards, Manas

rohanpadhye commented 6 years ago

Hi Manas,

You were right when you said:

Or is it the case that we expect that g will definitely be analyzed again during the fixed-point computation, and a monotonic analysis will eventually update the unsound results for the successors?

A cycle in the context transition graph is of course quite unfortunate from a performance perspective, but there is really nothing we can do other than just start with an optimistic assumption in one of the nodes and propagate information back and forth, weakening the data-flow values, until a fixed point is reached.

To see why why g will definitely be analyzed again, note that when a new context is created, all of its nodes are added to the work-list. Therefore, at some point we will analyze the remaining nodes from f(0) and reach its tail... This is when the isAnalysed flag will be set to true for the context corresponding to f(0) and the call-site at g(1) will be added back to the work-list (through the return edge). As an example:

The only case where this might appear to be unsound is something like the following:

function f(x) {
   g(x+1);
   // everything is TOP here
}

function g(y) {
   f(y-1);
   // everything is TOP here
}

Here, the call chain on a call to f(0) would be f(0) -> g(1) -> f(0) -> g(1) -> f(0) -> .... infinitely. In this case, the data-flow analysis terminates, but with what appears to be unsound results because in both contexts we would assume the optimistic value (i.e. TOP) after the call-sites and we would never get any useful data back. However, note that this program is itself is non-terminating, and therefore the CFG nodes where we made "unsound" assumptions will never be executed. Therefore, the analysis is not unsound.

Hope this makes sense.

manasthakur commented 6 years ago

Thanks a lot for the clear explanation, Rohan! This really helps.