Open pfalcon opened 7 years ago
@mewmew : If you're interested in others' fiascoes ;-). Interprocedural analysis is godawfully hard and doesn't crack on push. I thought there's light at the end of tunnel and pushing somewhat more will lead to "useful" results, but nah. I'm probably ready to take a break from that (especially that there's a lot of refactoring to do with more trivial analysis passes).
And there's a particularly awful situation on RISC, where there're too many registers. So, even a leaf function defines many expressions in registers, and unless all callers kill those registers, they are considered (potential) returns. And if such a temporary expression register is defined along just one branch in a function, it also becomes an argument (because it's the only way to return such register if it may be sometimes defined by this function, and sometimes not). This in turn leads to the non-killing mentioned above (because potential return from one function may be a potential arg to another function).
This all leads to wild argument and return lists which don't converge in rather sparse call graph (usecase I'm working on is actually library of functions, not a complete binary). So, safe, overestimating dataflow analysis sucks a lot, and heuristics should be used to prune majority of hallucinated args/returns :-(
Thanks for documenting your progress, especially time consuming set backs which other may encounter as well, but few ever decide to document.
Hats off.
So, safe, overestimating dataflow analysis sucks a lot, and heuristics should be used to prune majority of hallucinated args/returns :-(
Even though it hurts to say, sometimes pragmatism beats formalism (and solid theory).
Ok, few sleepless nights, and dependency-guided ordered algorithm cracks those 380 functions in 3 minute. But boy, was that the debugging, with atrociously hilarious bugs keeping it spin in a loop within a loop, with loop inside, and when you isolate a specific looping function and simplify it, it starts to loop where it's impossible to loop - all because of missed consistent refactoring (initial case was fixed a year ago, but it just keeps popping up in other places), missed CFG preprocessing for dataflow analysis. Well, all's behind now.
Well, not all. When it stopped looping and started to run in 3 mins, turned out that it converges, but to different solution every time, with nondeterministic sets being used. Which of course meant that I couldn't refactor unbelievably hairy code from all these nights.
And well, I started to look at the most unrealistic results (which belong to cute functions like printf), and figured I need to decode jumptables after all to get complete CFG, as having those indirect jumps dangling kills all analysis. What a life! ;-)
Glad to hear you've managed to make progress and track down the issues causing infinite loops.
and figured I need to decode jumptables after all to get complete CFG, as having those indirect jumps dangling kills all analysis. What a life! ;-)
I know, separating concerns is and continues to be of fundamental importance to solve the more difficult problems of decompilation in a clean fashion.
Rough phases outlined below, as I currently believe to be a good separation of concerns. Any input would be much appreciated :)
Any obvious steps you see that are missing?
For recovery of complete CFGs, take a look at the PhD talk by Alessandro Di Federico, rev.ng: a QEMU- and LLVM-based static binary analysis framework . He provides some good inspiration, mostly using symbolic execution to figure out the branch targets of indirect branches. There is also interesting notes on function (and tail call) discovery, using a notion of candidate functions, which are later promoted to functions based on a set of constraints (e.g. single entry point). The most interesting aspect is the discovery of tail calls, where a jmp to another function is identified as a tail call when it jumps past other functions.
Any obvious steps you see that are missing?
You don't want to hear "you've got it all wrong", do you? ;-). So, instead let me tell they I finally went to "announce" ScratchABlock (which before was in kinda stealth mode): https://www.reddit.com/r/ReverseEngineering/comments/66pa97/scratchablock_yet_another_crippled_decompiler/ . Generally, all this monkey business is dedicated to 2nd anniversary of the project, where I conclude that a lot of time and effort went to it, and it still doesn't do anything useful, hence knee-jerk (re)actions to make it do something useful. So, if you forgot the orthodox outlook I maintain at the decompilation, the dramatic discussion at Reddit above will remind that I don't treat stuff like "ABI parsing (e.g. PE, ELF, ...)" as part of any realistic high-quality decompilation project. Because every decompilation project out there did that and produced low-quality results. So, time to prune the trivia out of the decompilation.
That was one point, 2nd point - why do you ask me, there're smarter people than me who answered all question already, we just need to understand (and reconsider) the answers. The bible of decompilation, Van Emmerik's thesis, page 197 (of pdf). And here you'll see that you indeed got it wrong: the dataflow analysis is done first (parts in yellow, are, again not a decompilation), type analysis is advanced stage of it, and control flow analysis is the last.
It may be not immediately understandable why so, and that's why I wrote that we should understand the answers. I by now learnt it by heart: there's no need to structure unreachable code, it just needs to be removed, and you will know if it's reachable or not after dataflow analysis. Or dataflow analysis may show that some loop just copies memory, then you won't structure it as a loop, but rather as a memcpy().
On the same page, you'll read:
Control flow analysis, ..., is a largely solved problem.
Said the guy whose decompiler (Boomerang) can't structure the trivial constructs, like infinite loops or infinite loops with an if inside. That's why I say that we should reconsider the answers, and the biggest matter which requires reconsideration is the usage of SSA.
YMMV!
You don't want to hear "you've got it all wrong", do you? ;-).
Actually I do :) How else would you learn, if not by trying, making mistakes and learning from people with more experience.
I by now learnt it by heart: there's no need to structure unreachable code, it just needs to be removed, and you will know if it's reachable or not after dataflow analysis. Or dataflow analysis may show that some loop just copies memory, then you won't structure it as a loop, but rather as a memcpy().
Thanks! It's interesting that both of these make intuitive sense, but only so after hearing it and thinking about it. Before I mostly considered data flow and control flow analysis to be mostly separate processes. But it is indeed true, that a data flow analysis may prune and simplify parts of the control flow graph (either by functional equivalence rewrite, e.g. recovery of memcpy, or by pruning unreachable code, thus simplifying the CFG).
2nd anniversary of the project
Glad to see you keep determined to work on non-trivial problems :) Wrote a reply in your reddit announcement thread.
Cheers /u
both of these make intuitive sense, but only so after hearing it and thinking about it.
The first comment of this ticket, all (well, almost all) points there are described in Van Emmerik, it's just indeed, only after you start to peer at the results of those analysis, you get that: "oh, now I understand what that guy was talking about."
Btw, I turned this ticket into a doc to add to docs/, cross-referencing it to Van Emmerik thesis sections is the only todo left, and I'd like to do that...
For recovery of complete CFGs, take a look at the PhD talk by Alessandro Di Federico, rev.ng: a QEMU- and LLVM-based static binary analysis framework .
I read the slides before, would consider them too bare to provide much of the useful content. Didn't find time to watch 40mins of youtube video ;-).
He provides some good inspiration, mostly using symbolic execution to figure out the branch targets of indirect branches.
And that's exactly what I'm trying to avoid! The initial phases, like code vs data or CFG preparation, which aren't even decompilation per se, require all the advanced passes of decompilation, and actually more. But that's far too big and too deep a feedback cycle. But if done that way, we'll never even approach interesting problems of decompilation, being drowned in the slough of those matters, still finding the results aren't good. ("Pass over 80% of coreutils's testsuite" means 20% tests fail. Assuming 1:1 coverage of tests and functions/basic blocks, that means every fifth of them are handled by his analysis incorrectly. But who said 1:1? Maybe tests bias on testing trivial things, and maybe 50% of his results are incorrect).
In my case, CFGs are already hand-prepared (thanks to my ScratchABit tool) by a human who can assess 100% correctness on a case by case basis, all basic blocks are there, CFG just has connectivity problems where indirect jumps are dangling.
So, working on this analysis (implementation of which is necessarily adhoc and heuristical) required some shift in ScratchABlock. Before this, it was completely symbolic and code-based. As jump tables are effectively data, I had to add support for dealing with binary data. And yep, code up the heuristic analysis, which, thanks to already existing inventory of passes is relatively easy.
Btw, I turned this ticket into a doc to add to docs/, cross-referencing it to Van Emmerik thesis sections is the only todo left, and I'd like to do that...
Will look forward to the upload.
So, here's "AI pass" I wanted to add, which oracles about function parameter set computed by dataflow analysis.
params: [a2, a3, a4, a5, a7, a8, a9, a10, sp]
params_why: {a2: Likely a genuine param, a3: Likely a genuine param, a4: Likely
a genuine param, a5: Param because modified only along some paths, a7: Param
because modified only along some paths, a8: Param because modified only along
some paths, a9: Param because modified only along some paths, a10: Param because
modified only along some paths, sp: Param because address of object on stack
is taken}
The overall idea, to remind, is that dataflow analysis in its function is similar to a neural network. A neural network can do specific high-level things, but it does so using myriad of senseless steps, and is completely devoid of understanding how it achieves at the result. Consequently, nobody else have an idea how a specific NN works, so if it does everything well, except for one thing, nothing can be done. It can be retrained to take that thing into account, but nobody knows how that will influence other cases, and nobody will know, until actually hits them.
Dataflow analysis is not as moronic as neural networks, at least each step is understood and makes sense (even if oversimplified), and number of repetitions is manageable for a human, but it's still possible to arrive in the end at a surprising result. And looking at specific parts of the result, it's not clear how they got there.
So, the pass above looks at the high-level result above, and other high-level results, computed differently, tries to annotate high-level meaning of each piece of the result, which was computed using low-level means of dataflow analysis.
Trying to do such annotation leads to surprising conclusions. For example, Van Emmerik argues, and it's absolutely correct, that only overestimating analyses are correct. So, I have underestimating passes in ScratchABlock, and effectively started to dismantle their use, to keep problem space manageable. But using overestimating analysis, it's easy to arrive at the result which is "absolutely correct and absolutely useless", like result above where there're 9 param registers, including sp. On the other hand, there can be underestimating analysis, which gives only partial answer, but with 100% guarantee for parts of its result. Using such result alone is incorrect, but it may be very useful to annotate results of "correct, but too big" analysis. For example, oracle's output for register a5 above is incorrect - it's actually a genuine parameter (and not "likely", but 100%).
Of course, Van Emmerik talks about the issue above too, like it does about many important things. Specifically, he says that it may be useful to distinguish "guaranteed-something" from "maybe-something" from "guaranteed-not-something" (p.90).
He doesn't go into details, so above is the idea on using underestimating but more precise analysis to annotate results of broad, overestimating analysis in such way. Another way to do that would make dataflow analyses "fuzzy", by flowing all this "may" information together with data values. But with all seeming simplicity of standard dataflow analyses, I keep discovering grossly idiotic (*) issues when performing it, so idea to complicate it doesn't sound exciting. The idea is actually the opposite - to explain, and thus verify, results of existing analysis.
(*) They are grossly idiotic as soon as being caught. On their own, with idea of the input code in mind, they're completely normal. They just don't account for all the possible transformations which may happen to the code after it's input.
Once again, thanks for documenting your thought process. I think keeping truth values is definitely made more valuable by differentiating between guaranteed and formally verified truths, from inferred and potentially over-optimistic truths. Whether it makes sense to do a full blown fuzzy logic constraint propagation and proof engine, remains a topic for future exploration. Nonetheless, I'm happy to have gained insight into these thoughts. Thanks!
@mewmew : You're welcome. As these are complex matters, I decided to indeed leave a thought trail, if I need to break working on it and return later. Glad to hear it's useful beyond that.
Otherwise, can't wait for end of May, when another year of defeat can be declared and I can get back to other projects ;-). In the meantime, fighting back however ;-).
turned out that it converges, but to different solution every time, with nondeterministic sets being used.
This now should be fixed. Well, at least I found 2 stupid bugs involved in that. Can't retest fully so far, as being hit by https://github.com/pfalcon/ScratchABlock/issues/3 again, and this time hope to resolve it properly (not hackily as it's done so far).
This now should be fixed. Well, at least I found 2 stupid bugs involved in that. Can't retest fully so far, as being hit by #3 again, and this time hope to resolve it properly (not hackily as it's done so far).
Almost there. There's still a core of 3 functions which twitch themselves back and forth, but they surely will fall down too ;-). Interesting however what is the remaining bug there, with all the fixes already done...
Btw, I turned this ticket into a doc to add to docs/, cross-referencing it to Van Emmerik thesis sections is the only todo left, and I'd like to do that...
I did some work on docs, and because I still didn't find time to cross-reference the notes with Van Emmerik, just pushed them as is for now: https://github.com/pfalcon/ScratchABlock/compare/fa2d3dc2dc0d601b461b0333d4601eed7ee00354...03250e31bfe43b4e465344555d6c68022ee516a3
This is to document a milestone in ScratchABlock development. So, after considerable amount of work over last month, and uncountable number of fixes, I was finally able to run an interprocedural analysis which was able to recover arguments and returns in a group of 3 functions.
However, trying to run on my standard usecase (esp8266 bootrom, ~380 loosely connected functions) underwent a fiasco - after 2 hours of runtime and 300 analysis iterations, the analysis didn't converge. That's of course because ScratchABlock started with context-free processing, and even for interprocedural analysis, functions are processed one by one, in randomized order.
This is about the same problem as with structuring passes - while it's known that a particular processing order is beneficial (in most cases, merely in most cases), ScratchABlock so far does it simple, by "randomized" search. But what scales to search within typical couple of dozens of basic blocks or less, doesn't scale to "searching" within hundreds of functions.
So, next step going forward is to implement context-led, ordered processing based on DFS postorder processing of a call graph.