Open maximumspatium opened 6 years ago
Well, where do I start? ;-)
Especially, where arguments and return value are expected to be specified.
Well, of course, it's "expected" that a decompiler would figure parameters, arguments, and returns itself. Do I need to say that's a bit of a hard problem? ;-) As you might notice, SABl focuses around intra-procedural processing (in a sense that whenever possible, it tries to handle functions one by one with intra-procedural processing).
But params/returns processing is inherently inter-procedural, requiring a whole program (defined as a closed set of functions). So, I definitely tackled that, and there're even some tests, run separately via run_tests_inter
. However, approach employed by that script was found to be not scalable, see #2. So, around May this year, there was a big assault on the problem, leading to https://github.com/pfalcon/ScratchABlock/blob/master/inter_dataflow.py . That work is not complete, and I believe it was left in a state of not properly converging on a real-world dataset, i.e. on different runs, it produced different results. I believe analyzing it, I stumbled at the fact that stack parameters aren't handled, then possibly also affected by not supporting aggregate variables on stack, which in turn leads to the problem of alias analysis (for the stack frame in this case), which is known to require type analysis for adequate handling. At which point we hit a typical self-recursive processing loop in decompilation. At that time, I accepted demise. And I restarted more or less active work on SABl only now, and approaching from the side of reaping benefits from intra-procedural processing.
The PseudoC doc doesn't tell anything regarding this matter.
PseudoC is just a spec for machine-independent assembler, decompilation is outside of its scope.
Global information about functions in a program is kept in funcdb.yaml
. Per above, that's supposed to be auto-collected during processing (and thus, being updated during processing). As we aren't yet there, I added support for read-only, seed funcdb.yaml.in
. It didn't work as expected, and you watched it being fixed in https://github.com/pfalcon/ScratchABlock/pull/9#issuecomment-354355928 . The bug there was that funcdb.yaml.in specified that $a2
is a return, but funcdb.yaml.in
wasn't used, so assignment to $a2
was considered dead and was eliminated.
I suggest reviewing testcases run by run_tests_inter
. For example, looking at https://github.com/pfalcon/ScratchABlock/blob/master/tests/interproc-args2/func2.lst locally, you can't know anything about its parameters (estimate_params
pass would return empty list). But global analysis correctly deduces that this function passes thru $a2 it receives down to func1
, thus $a2 its param: https://github.com/pfalcon/ScratchABlock/blob/master/tests/interproc-args2/funcdb.yaml.exp#L18
Here's the expected SABl C-output: memcpy(arg0, arg1, 4);
Argument rewriting is not yet implemented, because proper param handling isn't, per above. It wouldn't be too hard to do adhoc rewriting based on current limitations and heuristics, but general approach is much more effort.
see #2
Also #3 and https://github.com/pfalcon/ScratchABlock/blob/master/docs/func-params-returns.md cover various facets.
Thank you your exhaustive answer raising many interesting points regarding function arguments and returns.
I agree with you that an attempt to discover function arguments and returns using a context-free inter-procedural analysis is hard and won't converge in some cases. That's why we need a way to specify some context for that analysis. Obviously, the most helpful things are to be considered are:
From my RE experience I know that the problem of determining function arguments is not as hard as the recognition of returns. In the most cases, no inter-procedural analysis is needed to solve the former but merely the liveliness analysis coupled with the knowledge about the calling convention used during compilation. The only exception are variadic functions causing big troubles for argument identification.
In short, we need to employ certain heuristics in order to capture the most common use cases. Fully unsupervised decompilation is impossible but anything that saves person's time is highly appreciated.
The most appreciated feature to me is the possibility to specify input arguments and returns manually. That could be considered a last resort for the cases where the automatic estimation of arguments/returns fails.
Is it currently possible to manually specify args/rets in SABl? Is there any test case I need to look at?
and won't converge in some cases
It should converge, if implemented correctly ;-).
In the most cases, no inter-procedural analysis is needed
Inter-procedural analysis is always needed as soon as you step away from leaf functions. Otherwise, you need to assume that any unknown function call uses all registers and modifies all registers (because indeed, it can). That can be pruned somewhat by applying calling convention filters, but the remaining set is still big. That throws away (quality of decompilation) very much.
In short, we need to employ certain heuristics in order to capture the most common use cases. Fully unsupervised decompilation is impossible but anything that saves person's time is highly appreciated.
Well, decompilation without automatic recovering of params/returns isn't decompilation. On the other hand, one problem of my development of ScratchABlock is a "perfectionist" attitude, like "no, I want it to do it all for me", but that approach is elusive, because there's always something missing "to do it all". So, I'm more than welcome more pragmatic approach of leveraging what it can already do, and using that experience for the next small step (all captured as tests, to see the progress, and avoiding regressions).
Is it currently possible to manually specify args/rets in SABl? Is there any test case I need to look at?
Yes, and my previous comment should have all the references. Let's summarize it: https://github.com/pfalcon/ScratchABlock/blob/master/tests/decomp/_xtos_set_exception_handler/funcdb.yaml.in#L3 specifies a return register for the function in the "seed" function database. More fields of a function database can be seen at https://github.com/pfalcon/ScratchABlock/blob/master/tests/interproc-args2/funcdb.yaml.exp, of which you're interested in the params
field.
Well, decompilation without automatic recovering of params/returns isn't decompilation. On the other hand, one problem of my development of ScratchABlock is a "perfectionist" attitude, like "no, I want it to do it all for me", but that approach is elusive, because there's always something missing "to do it all"...
Such a "perfectionist attitude" has been for a long time my personal problem, too. I finally recognized that it neither makes me happy nor it does increase my productivity.
Back to the topic.
I tried to specify function arguments using params
. It seems to work - the related expression won't be scrambled by DCE anymore.
I couldn't verify returns because a proper function rewriting isn't yet implemented. What should be done in order to implement argument rewriting? Where should I try to hook up?
(With holidays over, I have less time for SABl, so sorry for increasing delays with responses.)
Such a "perfectionist attitude" has been for a long time my personal problem, too. I finally recognized that it neither makes me happy nor it does increase my productivity.
Yep, but that means growing humble and digging technical debt of already implemented things, not embarking on new big things, and I'm exactly on that route, so if you're interested in new big things, you should be prepared to scratch them on your own for now ;-). I'd however encourage to help with solidifying the base - e.g., growing the test corpus, etc. (thanks for help with the CI!).
I tried to specify function arguments using params. It seems to work - the related expression won't be scrambled by DCE anymore.
You see such an effect because Xtensa calling convention is used for your PowerPC code: https://github.com/pfalcon/ScratchABlock/blob/master/arch.py . I'm in the process of refactoring that to allow to support "hints" for multiple arch's. Otherwise, the effect should be opposite: without params specified, all registers per the calling conventions should be treated as params, leading to junk being not DCE'ed.
What should be done in order to implement argument rewriting?
Something I wanted to ask - how well do you know/remember content of Van Emmerik's thesis? That's the bible of decompilation and contains answers to many questions. To implement gully general argument rewriting, first all kinds of arguments should be recognized, e.g. passed on stack. Also, a first stage of rewriting is like foo($foo=1, $bar=*(u32*)($var + 5))
, i.e. using "keyword" arguments, because in the general case, you can't know which register holds "first" param and which "second". Only later (at C output time) you can apply some order (by convention or arbitrarily) and get rid of "reg keywords".
Where should I try to hook up?
I'd suggest to start with looking at the existing case of rewriting implemented in SABl: https://github.com/pfalcon/ScratchABlock/blob/master/tests/stack_vars1.lst
Something I wanted to ask - how well do you know/remember content of Van Emmerik's thesis? That's the bible of decompilation and contains answers to many questions.
Well, it's been a while since I've read Van Emmerik's thesis. Because this work is quite large I don't remember very exact detail. Moreover, IIRC, there are several big areas of decompilation which were only touched on briefly without providing a concrete algorithm/answer.
The switch statement is just such an example. Quoting page 208 of Van Emmerik's thesis:
It should be possible to recognise this high level pattern [switch as binary tree]
in the CFG. There are no control flow edges from other parts of the program to
the conditional tree, and no branches outside what will become the switch statement.
The follow node (the first node that will follow the generated switch statement) is
readily found as the post dominator of the nodes in the conditional tree. The switch
cases are readily found by searching the conditional tree for instructions comparing
with a constant value. Range analysis would be better, since it would allow for ranges
of switch cases [...]
Again, it should be possible to recognise this high level pattern in the CFG and recreate
close to the original source code.
Although the main observations are correct, this description is too general and omit many important details like default
node or case
statements without break at the end. In short, it cannot be considered an exhaustive answer to that switch
problem.
Anyway, I'm prepared to recall his stuff anytime. Just give me the hint...
I keep asking many things because I'm not very familiar with SABl project structure. I'm eager to scratch things on my own but I often don't know the right location for them.
Where should that argument rewriting feature go into? In a separate high-level script (script_xxx) or in an existing lower level one?
For the switch dilemma (see #13), I lean towards creating a separate transformation script that runs after script_decompile
...
BTW, are you interested in getting your code better documented? I'm writing things down while experimenting with your project. I could try to bundle this experience (together with your comments scattered across commits and issues) to docstrings...
BTW, are you interested in getting your code better documented?
Sure, I'm interested in all things ScratchABlock, and fairly speaking, the only way I can see it can compete with other projects is by providing "clearer" (in some senses) code, ultimately, well documented. I have limited bandwidth though. But you might have noticed https://github.com/pfalcon/ScratchABlock/commit/02b17221ef802e43303565672e72eff7f3e65cbe#diff-52ba0208814fb33a43492cf9e1b7e316 and following commits for example.
I'm writing things down while experimenting with your project. I could try to bundle this experience (together with your comments scattered across commits and issues) to docstrings...
Sure, you're welcome to. Mind that things are in flux and WIP, and for me, the code (and higher-level docs) are of more priority. That's not the reason not to try to document things now (with the assumption that it all will need to be completely rewritten ;-)).
But I'd like to extend my hint that one of the best usefulness/effort ratio contributions you can make is to add testcases, both "integration" style (decompilation with script_decompile) and "unit" style (individual passes). Of course, feel free to share anything you'd like, that's appreciated (the usual suggestion is to avoid big change-drops, and start lean and mean, being ready to adjust it as we go).
I keep asking many things because I'm not very familiar with SABl project structure.
You're welcome to, and as time permits, I'm happy to answer them.
I'm eager to scratch things on my own but I often don't know the right location for them.
The best idea is to indeed look around how existing stuff is done and do it in a similar way. Fairly speaking, that's how I do it after each half-year hiatus from the project ;-).
Where should that argument rewriting feature go into? In a separate high-level script (script_xxx) or in an existing lower level one?
SABl consists of a collection of analysis/transformation passes. Per the current design, the passes are as independent from each other as possible. So, to implement new processing, you never add code to existing pass, always create a new pass. This is not efficient runtime-wise, but the problem we fight is domain complexity, so stuff should be kept as decoupled as possible, until proven to work robustly. After that, I can quote README:
The current approach of ScratchABlock is to grow a collection of relatively loosely-coupled algorithms for program analysis and transformation, have them covered with tests, and allow easy user access to them. The magic of decompilation consists of applying these algorithms in the rights order and right number of times. Then, to improve the performance of decompilation, these passes usually require more tight coupling. Exploring those directions is the next priority after implementing the inventory of passes as described above.
For the switch dilemma (see #13), I lean towards creating a separate transformation script that runs after script_decompile...
As far I saw from #13, you started doing it like it should be. The same applies here - each should be a separate pass (maybe few passes). script_*
is nothing but a simple algorithm to apply passes to achieve high-level processing (it's algorithm, not just a declarative list, because the former isn't powerful enough, but script_* itself should still just orchestrate other passes in high-level ways, not e.g. dig into CFG itself).
Not exactly function calls, but some get-going in that direction: https://github.com/pfalcon/ScratchABlock/commit/b2d2a204ee638364eebddce2447655d8fad4d858 , https://github.com/pfalcon/ScratchABlock/commit/bdc209cecdc954b22f98d3902fd32bd8b324a140 As can be seen, it's almost cheating, with no type analsys and stuff, and yet with it looks cuter ;-).
A usual notice about param order - there's no inherent order, and sorting in increasing reg order is nothing but heuristics. One can imagine a calling convention where 1st arg is passed in a3 and 2nd - in a2. Of course, realistically such a case could occur only in obfuscated code, but that's pretty a usecase about which not to forget.
Hello Paul,
can you tell me off the top of your head how ScratchABlock deals with function calls? Especially, where arguments and return value are expected to be specified. The PseudoC doc doesn't tell anything regarding this matter.
Below an example (again for my beloved arch - PowerPC):
where $r3, $r4 and $r5 are memcpy's arguments. Here's the expected SABl C-output:
memcpy(arg0, arg1, 4);