lifting-bits / dds

Dr. Disassembler
35 stars 9 forks source link

Semantics integration, per-function throwaway databases for heavyweight dataflow analyses #13

Open pgoodman opened 3 years ago

pgoodman commented 3 years ago

This issue serves to track ideas and goals related to integrating instruction semantics.

Basic example:

EA_LEA:   lea eax, [edi + esi * 2 + 0xf00]
EA_MOV:   mov ecx, dword ptr [eax]

From this, we'd like to create the following approximate relations:

raw_operation(0, FuncEA, EA_LEA, READ_REGISTER_32, REG_EDI, _, _)  ; read edi
raw_operation(1, FuncEA, EA_LEA, READ_REGISTER_32, REG_ESI, _, _)  ; read esi
raw_operation(2, FuncEA, EA_LEA, CONSTANT_32, 0xf00, _, _)
raw_operation(3, FuncEA, EA_LEA, BASE_2xINDEX_DISP, 0, 1, 2) ; edi + (esi * 2) + 0xf00
raw_operation(4, EA_LEA, WRITE_REGISTER_32, REG_EAX, 3, _)

raw_operation(5, FuncEA, EA_MOV, READ_REGISTER_32, REG_EAX, _, _)
raw_operation(6, FuncEA, EA_MOV, READ_MEMORY_32, 5, _, _)  ; `dword ptr [eax]`
raw_operation(7, FuncEA, EA_MOV, WRITE_REGISTER_32, REG_ECX, 6, _)

In the above, things like REG_EDI would be the unique IDs of values for the registers. So probably raw_operation wouldn't start at 0.

From here, we'd want rules that do some basic things like copy propagation. This means definition the post-instruction register state. These rules would exist in a separare datalog db, with one instance per function. The idea would be to run these, have them publish messages that we'd store back into the main db, then destroy these instances until they're needed again. Key idea: throwaway databases.


; Orchestrated by creator of throwaway db, unique values for `ValOnFuncEntry`.
#message func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry)

; Sent from the users of the main DB to us, based off of `transfer`.
#message local_transfer(FromEA, ToEA)

; !!!!
; Could have a bunch of messages, e.g. for constants or symbolic values that are
; user-provided, for interactive constant folding / symexec during a speculative execution
; or speculative specialization of a function.
; !!!!

; Base case for beginning of a function: each register has a tombstone `VAL_ON_FUNC_ENTRY` value
; on entry to a function.
post_inst_reg_state(FuncEA, RegName, ValOnFuncEntry)
    : func_entry_reg_state(FuncEA, RegName, ValOnFuncEntry).
    , !raw_operation(_, FuncEA, FuncEA, WRITE_REGISTER_32, RegName, _, _).

; Any operation that writes out the register value submits its value to the post-register state
; of that instruction. 
post_inst_reg_state(InsnEA, RegName, WrittenId)
    : raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, WrittenId, _).

; If we have a flow from `PredInsnEA` to `InsnEA`, and if `InsnEA` doesn't write to
; `RegName`, then propagate the value of `WrittenId` for `RegName`.
post_inst_reg_state(FuncEA, InsnEA, RegName, WrittenId)
    : local_transfer(PredInsnEA, InsnEA)
    , !raw_operation(_, FuncEA, InsnEA, WRITE_REGISTER_32, RegName, _, _)
    , post_inst_reg_state(FuncEA, PredInsnEA, RegName, WrittenId).

I think with the simple rules above, execution would converge toward the following:

For every point in a function, we would be able to express register values in terms of register values from another place in the function. This is something that the GrammaTech people mentioned in their ddisasm paper. Basically, we could say: there exists a path such that the register written at instruction EA1 is read by the instruction at EA2.

I think one thing that becomes apparent from this type of "raw operation" representation is that we'd want to represent conditional register writes that either preserve the register contents or alter them, that way we can model those data-centric flows.

pgoodman commented 3 years ago

post_inst_reg_state is basically reaching definitions :-)

pgoodman commented 3 years ago

I think it'd also be interesting to do a form of equality saturation over the IRs. This is be beneficial for loads/stores being converted into things like typed field accesses.