Open pgoodman opened 4 years ago
Here is a rough example of the patterns we want to recognize, and how we want to "re-interpret" them:
The following patterns describe "code redirection" patterns, that is, where we want to orchestrate the lifting of control-flow to redirect control flow to something other than what is actually in the binary. In practice, the idea is to redirect to the intended target, rather than the actual mechanical target.
call [__libc_start_main@plt]
call __libc_start_main
From an LLVM standpoint, this means the following:
remill::Instruction
with kCategoryIndirectCall
or something like that. Normally this triggers calling the CALL
semantics function, then making a function call to __remill_function_call
.__libc_start_main
, and call its lifted function instead of calling __remill_function_call
.jmp [printf@plt]
jmp printf
From an LLVM standpoint, this is similar to pattern and replacement (1). We want to have the same kind of redirection entry. Here, isntead of of lifting this as a call to the semantics, followed by a __remill_jump
, we want to lift it as a call to semantics, followed by a terminating tail call to the lifted external printf
.
_printf:
jmp [printf@plt]
foo:
...
call _printf
foo:
...
call printf
This is similar to (1), but instead of a calling to the lifted version of the internal _printf
, we want to redirect execution to the lifted external printf
.
_printf:
jmp [printf@plt]
foo:
...
jmp _printf
foo:
...
jmp printf
Similar to pattern 3, but using a terminating tail call redirection.
The following patterns describe data relocation-based patterns. This means operating on the actual operands of a lifted instruction, and substituting them with something else. Here are some examples of what we want to deal with.
mov rax, [__libc_start_main@plt]
call rax
tmp = alloca
store __libc_start_main, tmp
state->rax = load tmp
This one is tricky. We want a relocation entry that says that a memory load of the address __libc_start_main@plt
will load the address of the external __libc_start_main
. By extending the instruction lifter class, in a nearly identical way to McSema, we can interpose on the operands and look at if they are used for memory reads or address generation, then try to figure out the effective loaded address, and identify if a relocation applies. If a relocation applies, then we want to invent a new address to be loaded, based off of an alloca
that we pre-fill with the address of the external __libc_start_main
.
I think we can finally close this.
We can't close just yet. Parts of this issue are done, but not all parts. What remains to be done:
call [__libc_start_main]
. This means adding control-flow target support into FunctionLifter::VisitIndirectFunctionCall
The ELF thunk recognition code of McSema should be copied and adapted for Anvill so that if a function references an ELF thunk, then we go and follow through and find the referenced external and use its name in the prototype, rather than the name of the function itself, which may be prefixed with junk.
That is, instead of a prototype of this function having the name
_signal
or.signal
:We should instead follow through to the
.plt
segment...And take the info from here:
The relevant code to adapt from McSema is:
https://github.com/lifting-bits/mcsema/blob/master/tools/mcsema_disass/ida7/get_cfg.py#L334-L466