Open joxeankoret opened 5 years ago
Hello,
Let me elaborate more on this. It's good to have this here for reference purposes :)
In diaphora_ida.py one can see the following:
decoded_size, ins = diaphora_decode(x)
if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[0].offb
if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[1].offb
if decoded_size <= 0:
decoded_size = 1
...
curr_bytes = GetManyBytes(x, decoded_size, False)
What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash
. Another type of signature, named bytes_hash
, takes into account all instruction bytes. So, normally, function_hash
and bytes_hash
should be different. This works fine for X86, but I've noticed that, on ARM, offb
is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash
and function_hash
are, most of the times, equal!
Let's have a look at two examples.
The following shows information exported from an ARM binary:
sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
3845
sqlite> SELECT COUNT(*) FROM functions;
18424
While the following from an IA-32 binary.
sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
20877
sqlite> SELECT COUNT(*) FROM functions;
21034
So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash
which is different from function_hash
, as opposed to the IA-32 binary where most of the functions have different bytes_hash
and function_hash
values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:
if decoded_size <= 0:
decoded_size = 1
This tiny bug was verified using a simple IDA Python script like the following.
import idc
import idaapi
import idautils
TYPES = [
idaapi.o_mem,
idaapi.o_imm,
idaapi.o_far,
idaapi.o_near,
idaapi.o_displ
]
for segment in idautils.Segments():
functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))
for function in functions:
function = idaapi.get_func(function)
for head in idautils.Heads(function.startEA, function.endEA):
size = idaapi.decode_insn(head)
if size == 0:
print 'No instruction %#x' % head
if idaapi.cmd.Operands[0].type in TYPES:
if idaapi.cmd.Operands[0].offb != 0:
print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb)
if idaapi.cmd.Operands[1].type in TYPES:
if idaapi.cmd.Operands[1].offb != 0:
print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)
Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction()
API.
insn = idautils.DecodeInstruction(head)
itype = insn.itype
for i in xrange(6):
op_type = getattr(insn, 'Op%d' % (i + 1)).type
itype <<= 8
itype |= op_type
Had similar issues with PPC and Tricore. One my branch I added specific OpCode masking. Not scalable but it worked and is the only solution that I can think of.
D
On Fri, Jan 11, 2019, 7:42 AM Chariton Karamitas <notifications@github.com wrote:
Hello,
Let me elaborate more on this. It's good to have this here for reference purposes :)
In diaphora_ida.py one can see the following:
decoded_size, ins = diaphora_decode(x)if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[0].offbif ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[1].offbif decoded_size <= 0: decoded_size = 1...
curr_bytes = GetManyBytes(x, decoded_size, False)
What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash. Another type of signature, named bytes_hash, takes into account all instruction bytes. So, normally, function_hash and bytes_hash should be different. This works fine for X86, but I've noticed that, on ARM, offb is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash and function_hash are, most of the times, equal!
Let's have a look at two examples.
The following shows information exported from an ARM binary:
sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 3845 sqlite> SELECT COUNT() FROM functions; 18424
While the following from an IA-32 binary.
sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 20877 sqlite> SELECT COUNT() FROM functions; 21034
So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash which is different from function_hash, as opposed to the IA-32 binary where most of the functions have different bytes_hash and function_hash values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:
if decoded_size <= 0: decoded_size = 1
This tiny bug was verified using a simple IDA Python script like the following.
import idcimport idaapiimport idautils TYPES = [ idaapi.o_mem, idaapi.o_imm, idaapi.o_far, idaapi.o_near, idaapi.o_displ ] for segment in idautils.Segments(): functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))
for function in functions: function = idaapi.get_func(function) for head in idautils.Heads(function.startEA, function.endEA): size = idaapi.decode_insn(head) if size == 0: print 'No instruction %#x' % head if idaapi.cmd.Operands[0].type in TYPES: if idaapi.cmd.Operands[0].offb != 0: print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb) if idaapi.cmd.Operands[1].type in TYPES: if idaapi.cmd.Operands[1].offb != 0: print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)
Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction() API.
insn = idautils.DecodeInstruction(head)
itype = insn.itypefor i in xrange(6): op_type = getattr(insn, 'Op%d' % (i + 1)).type itype <<= 8 itype |= op_type
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joxeankoret/diaphora/issues/143#issuecomment-453538500, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIEb4ey6lCtUZVulLgk-71SOn_RkP3hks5vCKLUgaJpZM4Zfasd .
Reported by Huku.