Define human-readable IR of Reko and make it "round trippable"

mewmew commented 7 years ago

From @mewmew

another area I'd be interested in is for Reko to dump its information in between stages, in a well defined format, so other tools may hook into the phases of Reko and massage the data in certain ways, or extract useful information for other workflows.

From @uxmal

I have had this idea for a long while of making the reko IR "round trippable" so that it could be saved to a file. for another tool to process maybe a subset of LLVM IR

From @mewmew

then we could tie into KLEE and other tools

From @uxmal

I also want reko to be able to consume LLVM IR files as input. so you can take stuff a compiler front end has generated and go work on that

uxmal commented 7 years ago

Reko's format is "almost" round-trippable today, but needs that final push to make it happen.

uxmal commented 7 years ago

Turns out there is quite a semantic gap between LLVM IR and Reko IR. In particular, Reko's IR can support fullblown expression trees:

Mem0[eax:word32] = Mem0[eax + 8:word32] + 4

where LLVM appears not to. However, this doesn't preclude the possibility of consuming simple LLVM files and "lifting" them into Reko IR.

The Reko IR appears abundantly in the many unit tests and other test outputs (e.g. https://github.com/uxmal/reko/blob/master/src/tests/Analysis/DfaMutualTest.exp). Some challenges are:

numeric constants are not typed in the output, except by their length. This means that the constant 0x00 is a byte sized value, while 0x0000 is a halfword constant. If a numeric constant is of signed integer type, it is rendered as an ordinary decimal number. This means we can't tell if -1 is an int16, int32 or int64. A possible solution is to have a prefix or suffix that indicates the size of a constant. E.g.
```
Mem0[eax:word32] = Mem0[eax + 8:word32] + 4:int32
```
The Storage used for each identifier is not emitted in the output. Reko makes assumptions about storages when it discovers calling conventions etc. This could be solved by emitting in the output the architecture that was the origin of the code (similar to the LLVM triple directive) and then, in the beginning of each procedure, declare the storages of all identifiers:
```
def eax:word32 reg("eax")
Mem0[eax:word32] = Mem0[eax:word32] + 4:int32
```
Address and SegmentedAddress need to be able to round-trip as well. Suggested formats are:
```
eax = Mem0[0x00123400:ptr32:word32]
es = Mem0[(0x0C00:0x14CD):segptr32:word16]
```
or possibly changing numeric constants to a "constructor" syntax:
```
ecx = ecx + int32(1)
esi = Mem0[esi + word32(4):word32]
eax = Mem0[ptr32(0x00123400):word32]
es = Mem0[segptr32(0x0C00:0x14CD):word16]
```
A noticeable drawback is the legibility of the format suffers with all the type annotations, but if the format is to be persistable, that information needs to be present.
The current format has the advantage of having a pleasant C-like look. That makes it more difficult to parse, howerver. Going full hog to a S-expression syntax addresses that problem, at the expense of readability:
```
(proc
(block "l1"
    (let ecx (iadd ecx (int32 1)))
    (let esi (mem32 (iadd esi word(4))))
    (let eax (mem32 (ptr32 0x00123400))
    (let es (mem (segptr32 0x0C00 0x14CD) word16))))
```
which amusingly but not surprisingly looks very much like the C# code used in a great many unit tests (sample below is not 100% accurate but you get the drift I hope):
```
m.Label("l1");
m.Assign(ecx, m.IAdd(ecx,m.Int32(1)));
m.Assign(esi, m.Mem32(m.IAdd(esi, m.Word(4))));
m.Assign(eax, m.Mem32(Address.Ptr32(0x00123400)));
m.Assign(es, m.SegMem16(Address.SegPtr(0x0C00, 0x14CD)));
```
The cute thing about the C# format is that it could of course be compiled by the C# compiler .

I've created a wiki page (https://github.com/uxmal/reko/wiki/Internal-representation) to record what the IR representation should look like. I would appreciate feedback from users on this as well.

uxmal / reko

Define human-readable IR of Reko and make it "round trippable" #439