Open pfalcon opened 7 years ago
Related to function boundary detection, some inspiration may be sought from rev.ng
Talk at https://youtu.be/5CbuU4KwBCE
From slides at http://llvm.org/devmtg/2016-11/Slides/DiFederico-rev.ng.pdf
Skipping jump instruction in this case refers to a jump instruction (think tail call), which jumps over the basic blocks of another function.
@mewmew , Thanks, I noticed rev.ng, though was surprised (not really) that instead of being "suite of tools for static binary analysis", it's currently just a binary translator. Didn't see those slides, looked thru. Well... I saw a similar paper recently (https://syssec.mistakenot.net/papers/eurosp-2017.pdf), that one caught my attention with a title "Compiler-Agnostic Function Detection in Binaries". "Compiler-agnostic, really? How else could it be?"
So well, both that paper and rev.ng's try to deal with detection of all functions in a binary. Well, good (luck). That's something completely different than what ScratchABit does - it's purposed is to be a dumb tool, and detect only what guaranteedly can be detected. Beyond that, everything is offloaded to a human (which can do further analysis manually, or using heuristic plugins).
The topic of this ticket is how to do that automatic detection in a better way. Depth-first was implemented btw, I don't close this ticket because I'd like to review if there're some TODOs still left.
On related note, I recently did switch basic blocks in ScratchABlocks to end on a call (https://github.com/pfalcon/ScratchABlock/commit/1446a8380de3e6074b21ed423b039b491926b311), following the idea that in machine code, a call is much closer to a jump, than to a mathematical definition of function. That's of course how even Cifuentes suggested to do it, I just still find that other people like the idea that call == function too, e.g. http://x64dbg.com/blog/2016/07/27/Control-flow-graph.html
https://github.com/pfalcon/ida-xtensa2/issues/1 represents an issue with the current code traversal process of SAB: it tries to follow all code paths of the current function, before considering calls the current function makes. Unfortunately, some calls are effectively jumps and never return. Trying to decode bytes after such a call may lead to conflicts in byte types, which may affect decoding of other functions following such instructions. So, instead, should try to emulate how a real CPU does - do a depth-first traversal of as many functions as possible first, and consider return addresses from calls only after that. This should allow to detect natural stop-gaps for unconstrained and incorrect code propagation. E.g. if following a call instruction there's a literal data belonging to another function, then it doesn't make sense to try to interpret this data as an instruction.
This approach will however have problems with function boundary detection (such detection was the reason to adopt breadth-first scheme). So, it will need to be reworked.