Difference between llvmslicer and dg

twang15 commented 8 years ago

Initial question:

I am using llvmslicer. Except llvmslicer is based on cfg and dg is based on pdg, what are the other major differences between dg and llvmslicer ? Does dg generate smaller slice ?

twang15 commented 8 years ago

Marek's reply

You mean LLVMSlicer? (https://github.com/jirislaby/LLVMSlicer). The major difference is that dg has more precise and sound analyses (which can lead to smaller slices). The slicing algorithm itself slices more or less the same, since I haven't implemented the two-pass algorithm of Horwitz nor any other more precise version (e.g. specialization slicing) yet. Actually, dg was designed as a replacement for LLVMSlicer in the tool Symbiotic (https://github.com/staticafi/symbiotic). You can find more detailed information about dg here: http://is.muni.cz/th/396236/fi_m/thesis.pdf

twang15 commented 8 years ago

Yes, it is LLVMSlicer. LLVMSlicer is based on Mark Weiser's classic algorithm. As noted by others(Interprocedural Slicing Using Dependence Graphs), this CFG-based algorithm has "calling context" issue, but it is fixable (K. B. Gallagher. Some notes on interprocedural program slicing. In Source Code Analysis and Manipulation, 2004.) and the result would be equivalent to Horwitz's PDG/SDG-based algorithm.

When you say dg has more precise and sound analyses, it is in what sense ? Is the pointer analysis of dg more precise ?

mchalupa commented 8 years ago

Yeah, well our PDG based algorithm has the same calling context issue at this moment. Yes, the pointer analysis is more evolved.

Dg has two pointer analyses: flow-insensitive and flow-sensitive. The flow-sensitive is more precise by design, but it is still not well debugged (there's at least one bug I know about). The flow-insensitive is basically the same as in LLVMSlicer, with these differences:

it uses UNKNOWN_OFFSET, which allows it to handle for example this issue in LLVMSlicer: https://github.com/jirislaby/LLVMSlicer/issues/6
the UNKNOWN_OFFSET makes it also faster in some cases (e.g when analyzing arrays), since we can use UNKNOWN_OFFSET to stand for all offsets from some range, which is a sound approximation. So where LLVMSlicer would generate that p can point to x + 0, x + 4, x + 8, ..., x + 64 (if the offset is > 64, then the analysis goes unsound), we just say p can point to x + UNKNOWN_OFFSET, staying sound and fast (since our points-to set is of size 1 instead of some big number)
it handles correctly greater portion of LLVM than LLVMSlicer's one, for example, we have better support for memset and memcpy calls (copying whole blocks of memory with pointers), where the LLVMSlicer can be unsound
we run on LLVM 3.4 - 3.9

twang15 commented 8 years ago

Thanks for sharing these details. I have the following further questions.

For pointer analysis, it is well-known Andersen algorithm is practically precise.

As noted here http://pages.cs.wisc.edu/~fischer/cs701.f08/lectures/Lecture26.4up.pdf,

"Both context-sensitive points-to analysis and flow-sensitive points-to analysis give little improvement over Andersen’s analysis."

So, which pointer analysis algorithm are we using in dg ? Was there a research paper to systematically show the pointer analysis algorithm in dg doing better ?

Does dg support concurrent program slicing ? Or do we plan to support it ? Both Mark weiser algorithm (which is also the implementation in LLVMSlicer) and the traditional PDG-based approach are for sequential program slicing. There are also concurrent slicing algorithms(Nanda 2006 http://doi.acm.org/10.1145/1186632.1186636, Krinke 2003 http://doi.acm.org/10.1145/940071.940096).
Why does the PDG algorithm in dg also have the "calling context" problem ? I may resort to your thesis, however, it is great if you can explain in several sentence or give me a quick reference to understand the PDG algorithm you are using in dg.

On Fri, Nov 11, 2016 at 8:29 AM, Marek Chalupa notifications@github.com wrote:

Yeah, well our PDG based algorithm has the same calling context issue at this moment. Yes, the pointer analysis is more evolved.

Dg has two pointer analyses: flow-insensitive and flow-sensitive. The flow-sensitive is more precise by design, but it is still not well debugged (there's at least one bug I know about). The flow-insensitive is basically the same as in LLVMSlicer, with these differences:

it uses UNKNOWN_OFFSET, which allows it to handle for example this issue in LLVMSlicer: jirislaby/LLVMSlicer#6 https://github.com/jirislaby/LLVMSlicer/issues/6

the UNKNOWN_OFFSET makes it also faster in some cases (e.g when analyzing arrays), since we can use UNKNOWN_OFFSET to stand for all offsets from some range, which is a sound approximation. So where LLVMSlicer would generate that p can point to x + 0, x + 4, x + 8, ..., x + 64 (if the offset is > 64, then the analysis goes unsound), we just say p can point to x + UNKNOWN_OFFSET, staying sound and fast (since our points-to set is of

size 1 instead of some big number)

it handles correctly greater portion of LLVM than LLVMSlicer's one, for example, we have better support for memset and memcpy calls (copying

whole blocks of memory with pointers), where the LLVMSlicer can be unsound

we run on LLVM 3.4 - 3.9

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mchalupa/dg/issues/107#issuecomment-259957391, or mute the thread https://github.com/notifications/unsubscribe-auth/APd91hP2zbgl5N4B16sPPbvsv3k9WvPlks5q9G3SgaJpZM4KvO1y .

mchalupa commented 8 years ago

Ad 1) In dg we use field-sensitive Andersen's analysis (the field sensitivity is configurable). There's no paper about measuring only the pointer analysis, but in the theses I linked above are measurements of differences between LLVMSlicer and dg (I suppose the numbers may have changed a bit, since some work was done on dg since the thesis).
Ad 2) Dg does not support slicing of concurrent programs. There may be some plans, but they are really no priority now as far as I can see (it may change in the future, though)
Ad 3) The calling context problem is because we slice with simple backward reachability. That is to say, we just compute the dependencies and than go backward over the edges as far as we can. Compared to the two-pass algorithm of Horwitz et. al., they compute additional summary edges and then perform two-passes over PDG to prevent the calling context problem. Our PDG is the same as the Horwitz's PDG (apart from that we do not compute the summary edges, since we would not use them atm.), but we just do only one pass over all dependecy edges in the graph, not two (over different subsets of edges). That lead to that our algorithm coincides with the CFG algorithm now. Reason why we use only simple backward reachability is simple: I just didn't get to implementing the two-pass algorithm yet (specialization slicing could be also a very good feature).

twang15 commented 8 years ago

The paper "Which pointer analysis should I use ?" does comparison on five pointer analysis algorithms.

On Fri, Nov 11, 2016 at 2:36 PM, Marek Chalupa notifications@github.com wrote:

-

Ad 1) In dg we use field-sensitive Andersen's analysis (the field sensitivity is configurable). There's no paper about measuring only the pointer analysis, but in the theses I linked above are measurements of differences between LLVMSlicer and dg (I suppose the numbers may have changed a bit, since some work was done on dg since the thesis).

Ad 2) Dg does not support slicing of concurrent programs. There may be some plans, but they are really no priority now as far as I can see (it may change in the future, though)

Ad 3) The calling context problem is because we slice with simple backward reachability. That is to say, we just compute the dependencies and than go backward over the edges as far as we can. Compared to the two-pass algorithm of Horwitz et. al., they compute additional summary edges and then perform two-passes over PDG to prevent the calling context problem. Our PDG is the same as the Horwitz's PDG (apart from that we do not compute the summary edges, since we would not use them atm.), but we just do only one pass over all dependecy edges in the graph, not two (over different subsets of edges). That lead to that our algorithm coincides with the CFG algorithm now. Reason why we use only simple backward reachability is simple: I just didn't get to implementing the two-pass algorithm yet (specialization slicing could be also a very good feature).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mchalupa/dg/issues/107#issuecomment-260037746, or mute the thread https://github.com/notifications/unsubscribe-auth/APd91uXeCv0HGa7Qqvs518TZBr2SFaODks5q9MPBgaJpZM4KvO1y .

mchalupa commented 8 years ago

Oh yeah, I meant there's no research paper comparing our pointer analyses.

mchalupa / dg

Difference between llvmslicer and dg #107

size 1 instead of some big number)

whole blocks of memory with pointers), where the LLVMSlicer can be unsound

Ad 2) Dg does not support slicing of concurrent programs. There may be some plans, but they are really no priority now as far as I can see (it may change in the future, though)