Open mewmew opened 7 years ago
Reko's format is "almost" round-trippable today, but needs that final push to make it happen.
Turns out there is quite a semantic gap between LLVM IR and Reko IR. In particular, Reko's IR can support fullblown expression trees:
Mem0[eax:word32] = Mem0[eax + 8:word32] + 4
where LLVM appears not to. However, this doesn't preclude the possibility of consuming simple LLVM files and "lifting" them into Reko IR.
The Reko IR appears abundantly in the many unit tests and other test outputs (e.g. https://github.com/uxmal/reko/blob/master/src/tests/Analysis/DfaMutualTest.exp). Some challenges are:
numeric constants are not typed in the output, except by their length. This means that the constant 0x00
is a byte sized value, while 0x0000
is a halfword constant. If a numeric constant is of signed integer type, it is rendered as an ordinary decimal number. This means we can't tell if -1
is an int16
, int32
or int64
. A possible solution is to have a prefix or suffix that indicates the size of a constant. E.g.
Mem0[eax:word32] = Mem0[eax + 8:word32] + 4:int32
The Storage
used for each identifier is not emitted in the output. Reko makes assumptions about storages when it discovers calling conventions etc. This could be solved by emitting in the output the architecture that was the origin of the code (similar to the LLVM triple
directive) and then, in the beginning of each procedure, declare the storages of all identifiers:
def eax:word32 reg("eax")
Mem0[eax:word32] = Mem0[eax:word32] + 4:int32
Address
and SegmentedAddress
need to be able to round-trip as well. Suggested formats are:
eax = Mem0[0x00123400:ptr32:word32]
es = Mem0[(0x0C00:0x14CD):segptr32:word16]
or possibly changing numeric constants to a "constructor" syntax:
ecx = ecx + int32(1)
esi = Mem0[esi + word32(4):word32]
eax = Mem0[ptr32(0x00123400):word32]
es = Mem0[segptr32(0x0C00:0x14CD):word16]
A noticeable drawback is the legibility of the format suffers with all the type annotations, but if the format is to be persistable, that information needs to be present.
The current format has the advantage of having a pleasant C-like look. That makes it more difficult to parse, howerver. Going full hog to a S-expression syntax addresses that problem, at the expense of readability:
(proc
(block "l1"
(let ecx (iadd ecx (int32 1)))
(let esi (mem32 (iadd esi word(4))))
(let eax (mem32 (ptr32 0x00123400))
(let es (mem (segptr32 0x0C00 0x14CD) word16))))
which amusingly but not surprisingly looks very much like the C# code used in a great many unit tests (sample below is not 100% accurate but you get the drift I hope):
m.Label("l1");
m.Assign(ecx, m.IAdd(ecx,m.Int32(1)));
m.Assign(esi, m.Mem32(m.IAdd(esi, m.Word(4))));
m.Assign(eax, m.Mem32(Address.Ptr32(0x00123400)));
m.Assign(es, m.SegMem16(Address.SegPtr(0x0C00, 0x14CD)));
The cute thing about the C# format is that it could of course be compiled by the C# compiler .
I've created a wiki page (https://github.com/uxmal/reko/wiki/Internal-representation) to record what the IR representation should look like. I would appreciate feedback from users on this as well.
From @mewmew
From @uxmal
From @mewmew
From @uxmal