Bytes hash and functions hash are too often the same hash in ARM

joxeankoret commented 5 years ago

Reported by Huku.

huku- commented 5 years ago

Hello,

Let me elaborate more on this. It's good to have this here for reference purposes :)

In diaphora_ida.py one can see the following:

decoded_size, ins = diaphora_decode(x)
if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
  decoded_size -= ins.Operands[0].offb
if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
  decoded_size -= ins.Operands[1].offb
if decoded_size <= 0:
  decoded_size = 1
...

curr_bytes = GetManyBytes(x, decoded_size, False)

What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash. Another type of signature, named bytes_hash, takes into account all instruction bytes. So, normally, function_hash and bytes_hash should be different. This works fine for X86, but I've noticed that, on ARM, offb is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash and function_hash are, most of the times, equal!

Let's have a look at two examples.

The following shows information exported from an ARM binary:

sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
3845
sqlite> SELECT COUNT(*) FROM functions;
18424

While the following from an IA-32 binary.

sqlite> SELECT COUNT(*) FROM functions WHERE bytes_hash != function_hash;
20877
sqlite> SELECT COUNT(*) FROM functions;
21034

So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash which is different from function_hash, as opposed to the IA-32 binary where most of the functions have different bytes_hash and function_hash values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:

if decoded_size <= 0:
  decoded_size = 1

This tiny bug was verified using a simple IDA Python script like the following.

import idc
import idaapi
import idautils

TYPES = [
    idaapi.o_mem, 
    idaapi.o_imm,
    idaapi.o_far,
    idaapi.o_near,
    idaapi.o_displ
]

for segment in idautils.Segments():
    functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))

    for function in functions:
        function = idaapi.get_func(function)

        for head in idautils.Heads(function.startEA, function.endEA):
            size = idaapi.decode_insn(head)

            if size == 0:
                print 'No instruction %#x' % head

            if idaapi.cmd.Operands[0].type in TYPES:
                if idaapi.cmd.Operands[0].offb != 0:
                    print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb)
            if idaapi.cmd.Operands[1].type in TYPES:
                if idaapi.cmd.Operands[1].offb != 0:
                    print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)

Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction() API.

insn = idautils.DecodeInstruction(head)

itype = insn.itype
for i in xrange(6):
    op_type = getattr(insn, 'Op%d' % (i + 1)).type
    itype <<= 8
    itype |= op_type

djcatter commented 5 years ago

Had similar issues with PPC and Tricore. One my branch I added specific OpCode masking. Not scalable but it worked and is the only solution that I can think of.

D

On Fri, Jan 11, 2019, 7:42 AM Chariton Karamitas <notifications@github.com wrote:

Hello,

Let me elaborate more on this. It's good to have this here for reference purposes :)

In diaphora_ida.py one can see the following:

decoded_size, ins = diaphora_decode(x)if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[0].offbif ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]: decoded_size -= ins.Operands[1].offbif decoded_size <= 0: decoded_size = 1...

curr_bytes = GetManyBytes(x, decoded_size, False)

What happens here is that you remove operand bytes from the instructions and only use the opcode and prefixes to compute a signature, which you name function_hash. Another type of signature, named bytes_hash, takes into account all instruction bytes. So, normally, function_hash and bytes_hash should be different. This works fine for X86, but I've noticed that, on ARM, offb is always 0 (makes sense as operand encoding is interleaved with opcode encoding). In this case bytes_hash and function_hash are, most of the times, equal!

Let's have a look at two examples.

The following shows information exported from an ARM binary:

sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 3845 sqlite> SELECT COUNT() FROM functions; 18424

While the following from an IA-32 binary.

sqlite> SELECT COUNT() FROM functions WHERE bytes_hash != function_hash; 20877 sqlite> SELECT COUNT() FROM functions; 21034

So in my ARM binary's Diaphora database, only 3845 functions have a bytes_hash which is different from function_hash, as opposed to the IA-32 binary where most of the functions have different bytes_hash and function_hash values. After some investigation, turned out that all of the 3845 functions have data elements (e.g. constants, jump tables etc.) interleaved with their instructions! I believe it's the following "fallback" code that eventually reads a single byte from data heads interleaved with standard function instruction heads, but haven't verified:

if decoded_size <= 0: decoded_size = 1

This tiny bug was verified using a simple IDA Python script like the following.

import idcimport idaapiimport idautils TYPES = [ idaapi.o_mem, idaapi.o_imm, idaapi.o_far, idaapi.o_near, idaapi.o_displ ] for segment in idautils.Segments(): functions = idautils.Functions(idc.SegStart(segment), idc.SegEnd(segment))
for function in functions:
    function = idaapi.get_func(function)

    for head in idautils.Heads(function.startEA, function.endEA):
        size = idaapi.decode_insn(head)

        if size == 0:
            print 'No instruction %#x' % head

        if idaapi.cmd.Operands[0].type in TYPES:
            if idaapi.cmd.Operands[0].offb != 0:
                print '%#x 0 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[0].offb)
        if idaapi.cmd.Operands[1].type in TYPES:
            if idaapi.cmd.Operands[1].offb != 0:
                print '%#x 1 %#x' (idaapi.cmd.ea, idaapi.cmd.Operands[1].offb)
Here's a quick solution that can give similar results. Instead of relying on the instruction bytes, you can directly use information provided by the DecodeInstruction() API.

insn = idautils.DecodeInstruction(head)

itype = insn.itypefor i in xrange(6): op_type = getattr(insn, 'Op%d' % (i + 1)).type itype <<= 8 itype |= op_type

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joxeankoret/diaphora/issues/143#issuecomment-453538500, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIEb4ey6lCtUZVulLgk-71SOn_RkP3hks5vCKLUgaJpZM4Zfasd .

joxeankoret / diaphora

Bytes hash and functions hash are too often the same hash in ARM #143