How to decompile a binary?

rfalke commented 6 years ago

This my be a RTFM but I wasn't successful. Steps:

$ cd /tmp
$ git clone --recursive https://github.com/pfalcon/ScratchABit
$ cd ScratchABit/
$ cp /usr/bin/true .
$ file true
true: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=732174354e1961687406a4689546c00c58b8ef00, stripped
$ python3 ScratchABit.py --save true
Loading true...

$ ls true.scratchabit
project.aprops.55ab0000  project.aprops.55ab7000  project.aprops.55ac0000  project.aprops.55ac7000  project.aspace.55ab0000  project.aspace.55ab7000  project.aspace.55ac0000  project.aspace.55ac7000
project.aprops.55ab1000  project.aprops.55ab8000  project.aprops.55ac1000  project.aprops.55ac8000  project.aspace.55ab1000  project.aspace.55ab8000  project.aspace.55ac1000  project.aspace.55ac8000
project.aprops.55ab2000  project.aprops.55ab9000  project.aprops.55ac2000  project.aprops.55ac9000  project.aspace.55ab2000  project.aspace.55ab9000  project.aspace.55ac2000  project.aspace.55ac9000
project.aprops.55ab3000  project.aprops.55aba000  project.aprops.55ac3000  project.aprops.55aca000  project.aspace.55ab3000  project.aspace.55aba000  project.aspace.55ac3000  project.aspace.55aca000
project.aprops.55ab4000  project.aprops.55abb000  project.aprops.55ac4000  project.aprops.55cd0080  project.aspace.55ab4000  project.aspace.55abb000  project.aspace.55ac4000  project.aspace.55cd0080
project.aprops.55ab5000  project.aprops.55abc000  project.aprops.55ac5000  project.aprops.55cd1080  project.aspace.55ab5000  project.aspace.55abc000  project.aspace.55ac5000  project.aspace.55cd1080
project.aprops.55ab6000  project.aprops.55abf000  project.aprops.55ac6000  project.aprops.55cd11e0  project.aspace.55ab6000  project.aspace.55abf000  project.aspace.55ac6000  project.aspace.55cd11e0

$ cd ..
$ git clone https://github.com/pfalcon/ScratchABlock.git
$ cd ScratchABlock
$  mv ../ScratchABit/true.scratchabit/ .
$ ./apply_xform.py --script script_decompile --format c true.scratchabit/
INFO:decomp:Using funcdb: true.scratchabit//funcdb.yaml
$

There is no new file written. What do I miss here?

maximumspatium commented 6 years ago

There is no new file written. What do I miss here?

ScratchABlock expects its input to be in PseudoC format, see here.

You'll need to convert your assembly to PseudoC to be able to use ScratchABlock. IIRC, ScratchABit disassembler doesn't currently offer such a conversion. You'll probably need to roll out your own solution like I did for PPC...

pfalcon commented 6 years ago

@rfalke: Thanks for dropping by. As was said by by @maximumspatium, ScratchABlock works strictly with PseudoC program representation. To understand that choice better, it should be noted that ScratchABlock comes out from the fatigue with all the other decompilers out there, which start with all those binary format parsers, loaders, machine code lifters and other glorious crap which has little to do with the decompilation per se. It was an explicit choice to follow Unix paradigm and leave machine code parsing to outside tools.

(Btw, I know that in general case binary analysis is needed for decompilation, e.g. for distinguishing code from data, but that such a low-level analysis, and needed for other things like disassembly, that I explicitly leave it outside the scope of SABlock, to the other tools).

These points (without commentary) were now made explicit in f1828eb3390901bdaa76821c9d3416bf9be04983.

So, in my ideal world, decompilers would take a common and easy to generate/understand representation (which LLVM IR is not), and it will be easy to compare them instead of constantly hitting cases like: "oh, this DOS binary cannot be loaded" or "oh, this architecture is not supported". Of course, nobody cares, and I'm just like everyone ;-).

If you're interested in PseudoC corpus, there's a dozen megabytes in https://github.com/pfalcon/xtensa-subjects/blob/master/2.0.0-p20160809/out.lst

rfalke commented 6 years ago

I understand than separating these phases makes sense.

Than let me rephrase my question: what program is out there to convert a binary to a PseudoC representation? @maximumspatium doesn't think ScratchABit is doing this and the Readme of ScratchABit also doesn't mention it. But on the other side the file https://github.com/pfalcon/xtensa-subjects/blob/master/2.0.0-p20160809/out.lst was generated by some software. Maybe this can be clarified in some readmes.

pfalcon commented 6 years ago

https://github.com/pfalcon/xtensa-subjects/blob/master/2.0.0-p20160809/out.lst was generated by https://github.com/pfalcon/ida-xtensa2/blob/master/xtensa.py#L959 , and I specifically point to the lines which choose whether to generate native Xtensa assembly mnemonics or PseudoC mnemonics.

All this possible due to the fact that Xtensa is about the most pleasant RISC ISA around: it both doesn't have unRISCy stuff like ARM's flag registers, and doesn't have too-RISCy stuff like MIPS delay slots. Due to this, each single Xtensa instruction translates to single PseudoC instruction. Well, almost, there's some unRISCy stuff still, like conditional moves, but PseudoC cunningly covers those using "macro-like" functionality, https://github.com/pfalcon/ScratchABlock/blob/master/docs/PseudoC-spec.md#macro-like-functionality , so correspondence "1 Xtensa asm instruction == 1 PseudoC instruction" remains.

Most other ISA wouldn't be supported by ScratchABit plugins, as one asm instruction would be translated to more than one PseudoC instruction. But there ideas how to handle that, even on ScratchABit: a) allow multiple instructions per line; b) even more cunning macros, see text above. None of this was implemented.

But in general case you don't need ScratchABit. Start with a simple awk script to convert native arch's asm into PseudoC. For many archs, it should be possible to go quite along with context-free rewriting (i.e. convert each instruction looking at just that instruction). Such conversion would be verbose, but ScratchABlock expr propagation and dead code elimination would clean up that quickly.

Very bare x86 example is in https://github.com/pfalcon/ScratchABlock/tree/master/tests/decomp/020-cond-flags1

maximumspatium commented 6 years ago

But in general case you don't need ScratchABit. Start with a simple awk script to convert native arch's asm into PseudoC.

Although this sounds easy, handling of real-world programs is much harder than that. Almost every Asm need to be pre-processed to remove fancy ABI details ScratchABlock won't be able to cope with. I mean function prologues/epilogues, stack frames and machine code idioms (the opposite to macros). The problem is that no free software solution to solve these problems exists. IDA does them all but it's neither free nor extensible - low-level code transformations are closed and will be performed behind your back. Processing of machine code idioms requires sophisticated program analysis. Fortunately, PseudoC allows manual editing so the short-term solution is to replace idiomatic code sequences with high-level equivalents by hand. Unfortunately, it's very tedious and prone to errors. The long-term solution would be to roll out specialized scripts on top of an intelligent disassembler offering you program analysis and transformation capabilities. So we're back to the beginning: you'll probably need ScratchABit unless you want to perform low-level code transformations manually...

rfalke commented 6 years ago

I want to compare decompilers. If I look at the decompilers which are currently under active development (reko, boomerang, holdec, snowman, retdec, hex-rays, radeco, ...) I see that there is exactly one common input type which is an exe file with the code. More specifically 32bit intel (i386+) in an ELF container is a format which is supported by every decompiler from the list above. And because of this it is also the format I write my test programs in (for example https://github.com/rfalke/decompiler-subjects/tree/master/from_holdec/dmi/cfg).

I would like to also test ScratchABlock, include it in https://github.com/rfalke/decompiler-subjects and provide feedback but for this I need some automatic way to decompile a binary.

I understood that there is code for an automatic translation from Xtensa to PseudoC but none for any other architecture. So it looks to me that sadly ScratchABlock can't be included.

pfalcon commented 6 years ago

Although this sounds easy, handling of real-world programs is much harder than that.

For ugly and irregular archs like x86? Of course. That's why every month yet another microsoft pops up with yet another llvm-mctool - because it's hard, and previous such dudes can't be trusted to do it right, so everyone redoes it from scratch again and again ;-).

maximumspatium commented 6 years ago

I want to compare decompilers. If I look at the decompilers which are currently under active development (reko, boomerang, holdec, snowman, retdec, hex-rays, radeco, ...) I see that there is exactly one common input type which is an exe file with the code. More specifically 32bit intel (i386+) in an ELF container is a format which is supported by every decompiler from the list above.

That's the basic problem with the existing decompilers. You'll be struggling to make them to process some binary. Many of them will just fail due to ABI related issues.

Comparing decompilation results in some automatic way is almost impossible because:

dealing with ABI details (disassembling, pre-processing, idiom analysis) isn't decompilation (as repeatedly stated by Paul @pfalcon) but a really hard obstacle with real-world programs. The lack of an universal solution for that wastes development time and resources. Poorly implemented solutions will usually degrade decompilation results.
the most decompilers won't allow you to skip the ABI analysis and ASM->IR translation because their IR is private so you won't be able to compare decompiler transformations in a machine-independent way. You'll be comparing apples to oranges
decompilers working with SSA will output different code as those without SSA

Frankly speaking, a fully automatic decompilation is out of reach. I see decompilers as tools helping to minimize manual processing by rolling out various analysis and transformations guided by humans. A fair comparison should measure the helpfulness of a decompiler.

EDIT: I was able to decompile a real-world PowerPC binary using ScratchABlock. Kudos to Paul! So ScratchABlock was not only useful but my only option (no other decompiler in the world supports PowerPC). In this case, it doesn't matter if it can be directly compared to other decompilers...

maximumspatium commented 6 years ago

For ugly and irregular archs like x86? Of course.

Yes, but other archs (ARM, PPC, MIPS) require that too. Maybe Xtensa doesn't. Unfortunately, I never got a chance to work with such a great hardware, but mostly with boring one...

it's hard, and previous such dudes can't be trusted to do it right, so everyone redoes it from scratch again and again

It's indeed true and sad. An open and extensible solution is needed...

rfalke commented 6 years ago

While an automatic comparison of the decompiler outputs would be nice this isn't the goal.

I think there is a different understanding of what decompilation consists of. For me a compiler transforms source code into a binary. A decompiler transforms a binary into source code. Both of them are of course using multiple stages which may be external programs or internal phases. But these are implementation details.

For the compiler these phases may be preprocessor, the real compiler, assembler and linker. For the decompiler the phases are usually: loader, disassembler, transform to IR, CFG recovery, value propagation and expression simplification, type analysis, stack analysis / function calling and output.

So in my mental model ScratchABlock covers some actions/phases (maybe also the important ones) of a decompiler but something is missing to name it a proper decompiler.

rfalke commented 6 years ago

Regarding PowerPC: You may want to give Reko (https://github.com/uxmal/reko/wiki/Supported-binaries) a try.

maximumspatium commented 6 years ago

Regarding PowerPC: You may want to give Reko (https://github.com/uxmal/reko/wiki/Supported-binaries) a try.

Thanks. Reko doesn't support Apple PEF so it's useless in my case. That brings us back to the beginning: disassembling and decompilation need to be separated. Otherwise, we'll spend time and resources implementing the first three steps again and again, each time for another decompiler project, programming language and CPU architecture...

rfalke commented 6 years ago

You propose to use a new IR with only 2 supported instructions sets (Xtensa and PowerPC) and these two instruction set are easy to do. And it is unclear if the harder ones like i386 can be supported by the new IR and satisfy the goal of the IR ("Syntax should be automagically understood by any (qualified) programmer"). Sounds quite risky to me.

rfalke commented 6 years ago

Another note: There are cases where the second half requires data from the first half. So it isn't a one-way street. Examples are: jump table for indexed jumps (here also relocs have to be processed) and loading constants (here printf format strings are important to generate good output).

pfalcon commented 6 years ago

Another note: There are cases where the second half requires data from the first half. So it isn't a one-way street.

Given that it was said at the beginning of our conversation: https://github.com/pfalcon/ScratchABlock/issues/30#issuecomment-435650690

"(Btw, I know that in general case binary analysis is needed for decompilation, e.g. for distinguishing code from data, but that such a low-level analysis, and needed for other things like disassembly, that I explicitly leave it outside the scope of SABlock, to the other tools)."

- then you're very attentive interlocutor, and it's very nice to talk to you. To avoid any further confusion on your side how open-source projects work, let me make it clear: "Patches welcome".

Btw, an (adhoc, suitable for a particular usecase I had at hand) pass to decode jumptables is coming. Quite unsurprisingly it requires symbol table and raw binary data from sections of a binary. But nope, they don't come directly with ELF file, because ScratchABlock isn't concerned with ELF files and stuff. (For me) they come from ScratchABit, that's the tool which is concerned with decoding stuff like binary file formats. They may come from any other tool either.

maximumspatium commented 6 years ago

Btw, an (adhoc, suitable for a particular usecase I had at hand) pass to decode jumptables is coming.

Do we want to open a separate issue for that? I'd like to contribute to this topic as well based on my recent decompilation experience.

Quite unsurprisingly it requires symbol table and raw binary data from sections of a binary.

It would be interesting to see your solution. I used a very simple approach converting each (index, address) tuple to a separate PseudoC "if" statement. Later, their structuring was done manually resulting in a messy output. The long-term solution would be a full-fledged "switch" structuring pass...

rfalke commented 6 years ago

Thanks for the invitation but I have my own (closed source) decompiler (http://kronotai.com/wordpress/holdec/) which gets most of the time.

I will just leave you now. If ScratchABit+ScratchABlock can process x86 binaries drop me a line at https://github.com/rfalke/decompiler-subjects

pfalcon commented 6 years ago

@maximumspatium:

Do we want to open a separate issue for that? I'd like to contribute to this topic as well based on my recent decompilation experience.

Sure, if you have some ideas you want to share, go ahead.

It would be interesting to see your solution. I used a very simple approach converting each (index, address) tuple to a separate PseudoC "if" statement.

Well, what I remember of it (work on it this spring), is that I introduced generalized switch node indeed ;-). As usual, need to checked for (over)generality, testcase added, etc.

pfalcon commented 6 years ago

@rfalke

Thanks for the invitation but I have my own (closed source) decompiler (http://kronotai.com/wordpress/holdec/) which gets most of the time.

I know. I follow the scene for quite a while, in particular read your blog (all 3 posts in it or how many you have) years before I got sick of what happens in this scene and embarked on writing my own piece ;-).

pfalcon / ScratchABlock

How to decompile a binary? #30