Why are some of the registers stored as tuples in the python binding?

jt0dd commented 2 years ago

I noticed using the python binding for Unicorn, some of the registers are stored as tuples, for example:

UC_X86_REG_FP0:(0, 0)
UC_X86_REG_GDTR:(0, 0, 0, 0)
UC_X86_REG_LDTR:(0, 0, 0, 0)
UC_X86_REG_TR:(0, 0, 0, 0)

At first I thought, well maybe these registers are flags and it's more efficient or perhaps just easier to work with if we store / manipulate them in pieces. So I looked into some of them to see if this guess made sense, particularly GDTR and LDTR:

Memory management registers — The GDTR, IDTR, task register, and LDTR specify the locations of data structures used in protected mode memory management. See Chapter 2, “System Architecture Overview,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

So, in that section:

2.1.1 Global and Local Descriptor Tables When operating in protected mode, all memory accesses pass through either the global descriptor table (GDT) or an optional local descriptor table (LDT) as shown in Figure 2-1. These tables contain entries called segment descriptors. Segment descriptors provide the base address of segments well as access rights, type, and usage information.

It goes on, and maybe I'm missing something but I don't see why these registers would benefit from being broken into pieces. So it seems like that theory is wrong.

Each segment descriptor has an associated segment selector. A segment selector provides the software that uses it with an index into the GDT or LDT (the offset of its associated segment descriptor), a global/local flag (determines whether the selector points to the GDT or the LDT), and access rights information. To access a byte in a segment, a segment selector and an offset must be supplied. The segment selector provides access to the segment descriptor for the segment (in the GDT or LDT). From the segment descriptor, the processor obtains the base address of the segment in the linear address space. The offset then provides the location of the byte relative to the base address. This mechanism can be used to access any valid code, data, or stack segment, provided the segment is accessible from the current privilege level (CPL) at which the processor is operating. The CPL is defined as the protection level of the currently executing code segment. See Figure 2-1. The solid arrows in the figure indicate a linear address, dashed lines indicate a segment selector, and the dotted arrows indicate a physical address. For simplicity, many of the segment selectors are shown as direct pointers to a segment. However, the actual path from a segment selector to its associated segment is always through a GDT or LDT. The linear address of the base of the GDT is contained in the GDT register (GDTR); the linear address of the LDT is contained in the LDT register (LDTR). 2.1.1.1 Global and Local Descriptor Tables in IA-32e Mode GDTR and LDTR registers are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and compatibility mode). For more information: see Section 3.5.2, “Segment Descriptor Tables in IA-32e Mode.” Global and local descriptor tables are expanded in 64-bit mode to support 64-bit base addresses, (16-byte LDT descriptors hold a 64-bit base address and various attributes). In compatibility mode, descriptors are not expanded.

So if not to make it easier to work with registers that hold multiple separate pieces of information, why are some of these stored as tuples?

In case it's unclear how I'm arriving at these values, here's my code, the values are read / assigned at class CPUState:

# Keystone
# The Ultimate Assembler
from keystone import *

# Capstone
# The Ultimate Disassembler
from capstone import *

# Unicorn
# The ultimate CPU emulator
import unicorn
from unicorn import *
from unicorn.x86_const import *

from pprint import pprint

ks = Ks(KS_ARCH_X86, KS_MODE_64)
ks.syntax = KS_OPT_SYNTAX_ATT
md = Cs(CS_ARCH_X86, CS_MODE_64)
mu = Uc(UC_ARCH_X86, UC_MODE_64)

KB = 1024
MB = KB * KB

# push    $0x21       # '!'
# mov     $1, %rax    # sys_write call number 
# mov     $1, %rdi    # write to stdout (fd=1)
# mov     %rsp, %rsi  # use char on stack
# mov     $1, %rdx    # write 1 char
# syscall   
# add     $8, %rsp    # restore sp 

ASM = b''
ASM += b'PUSH $0x21;'
ASM += b'MOV $1, %rax;'
ASM += b'MOV $1, %rdi;'
ASM += b'MOV %rsp, %rsi;'
ASM += b'MOV $1, %rdx;'
ASM += b'syscall;'
ASM += b'ADD $8, %rsp;'

#print(f"Assembling code: {ASM}")
#BIN, count = ks.asm(ASM)

try:
   BIN, count = ks.asm(ASM)
   print("%s = %s (number of statements: %u)" %(ASM, BIN, count))
except KsError as e:
   print("ERROR: %s" %e)

START_ADDR = 0x0
BIN = bytes(BIN)
print(f"Instruction Count: {count}")
print(f"Binary:\n{BIN.hex()}")
print("Dissassembly:")
BIN_LEN = len(BIN)
END_ADDR = START_ADDR + BIN_LEN
#saved_context = mu.context_save()
# (sanity check) print disassembled machine code
for i in md.disasm(BIN, START_ADDR):
    print(f'0x{i.address}\t {i.mnemonic}\t {i.op_str}')

# enumerate registers
reg_keys = []
state_sequence = []
constants = dir(unicorn.x86_const)
for val in constants:
    if val[0:10] == 'UC_X86_REG':
        try:
            mu.reg_read(unicorn.x86_const.__getattribute__(val))
            reg_keys.append(val)
        except:
            print('skipped reg:', val)

class CPUState:
    def __init__(self, mu):
        self.regs = {}
        for val in reg_keys:
            # print(f"reading reg from constant ({val}): {unicorn.x86_const.__getattribute__(val)}")
            self.regs[val] = mu.reg_read(unicorn.x86_const.__getattribute__(val))

def compare_states(s1, s2):
    for reg_name in s1.regs:
        s1_val = s1.regs[reg_name]
        s2_val = s2.regs[reg_name]
        diff = s1_val - s2_val
        if diff != 0:
            print(f"changed: {reg_name}:{s1_val} => {s2_val} ({diff})")

def capture_reg_changes(mu):
    state_sequence.append(CPUState(mu))

def analyze(uc, address, size, idk):
    buf = uc.mem_read(address, size)
    for i in md.disasm(buf, address):
        print(f"0x{i.address}:\t{i.mnemonic} \t{i.op_str}")
    capture_reg_changes(uc)

# add single step emulation hook
mu.hook_add(UC_HOOK_CODE, analyze)

# map 2MB memory for this emulation
mu.mem_map(START_ADDR, 2 * MB)

# write machine code to be emulated to memory
mu.mem_write(START_ADDR, BIN)

# setup stack
mu.reg_write(UC_X86_REG_RSP, START_ADDR + MB)

print('Emulation:')

# emulate code in infinite time & unlimited instructions
mu.emu_start(START_ADDR, END_ADDR)

# now print out some registers
print("Emulation done. Below is the CPU context:")
#print(pprint(CPUState(mu).regs))

compare_states(state_sequence[0], state_sequence[1])
#print(Uc.__dict__.keys())

I noticed the discrepancy because I was planning to track differences between register values through subtraction when I realized it throws an error:

Instruction Count: 8
Binary:
6a2148b8010000000000000048bf01000000000000004889e648ba01000000000000000f054883c408
Dissassembly:
0x0  push    0x21
0x2  movabs  rax, 1
0x12     movabs  rdi, 1
0x22     mov     rsi, rsp
0x25     movabs  rdx, 1
0x35     syscall     
0x37     add     rsp, 8
skipped reg: UC_X86_REG_MSR
Emulation:
0x0:    push    0x21
0x2:    movabs  rax, 1
0x12:   movabs  rdi, 1
0x22:   mov     rsi, rsp
0x25:   movabs  rdx, 1
0x35:   syscall     
0x37:   add     rsp, 8
Emulation done. Below is the CPU context:
changed: UC_X86_REG_EIP:0 => 2 (-2)
changed: UC_X86_REG_ESP:1048576 => 1048568 (8)
Traceback (most recent call last):
  File "transform.py", line 143, in <module>
    compare_states(state_sequence[0], state_sequence[1])
  File "transform.py", line 90, in compare_states
    diff = s1_val - s2_val
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'

Also sidenote, I noticed UC_X86_REG_MSR is a constant Unicorn exposes, but it's the only UC_x86_REG_ constant that if you try to call reg_read on it, throws an error. https://github.com/unicorn-engine/unicorn/issues/1518

wtdcode commented 2 years ago

You are asking at the right time. See https://github.com/unicorn-engine/unicorn/issues/406

jt0dd commented 2 years ago

You are asking at the right time. See #406

I read through that and I didn't completely understand how it relates.

jt0dd commented 2 years ago

This issue is really breaking my project. Registers aren't tuples, they're bytes. I think it doesn't make sense for Unicorn to do any special formatting of the registers, let the developer convert to separate values if they want to. All the registers should read as bytes not tuples... Why is it happening? Someone please at least explain why the tuple represents so I know how to convert it back to bytes... Or even just link me the source code where it's happening, I'll figure it out.

Registers I found returning tuples:

UC_X86_REG_FP0 UC_X86_REG_FP1 UC_X86_REG_FP2 UC_X86_REG_FP3 UC_X86_REG_FP4 UC_X86_REG_FP5 UC_X86_REG_FP6 UC_X86_REG_FP7 UC_X86_REG_GDTR UC_X86_REG_IDTR UC_X86_REG_LDTR UC_X86_REG_TR

wtdcode commented 2 years ago

This issue is really breaking my project. Registers aren't tuples, they're bytes. I think it doesn't make sense for Unicorn to do any special formatting of the registers, let the developer convert to separate values if they want to. All the registers should read as bytes not tuples... Why is it happening? Someone please at least explain why the tuple represents so I know how to convert it back to bytes... Or even just link me the source code where it's happening, I'll figure it out.

Registers I found returning tuples:

UC_X86_REG_FP0 UC_X86_REG_FP1 UC_X86_REG_FP2 UC_X86_REG_FP3 UC_X86_REG_FP4 UC_X86_REG_FP5 UC_X86_REG_FP6 UC_X86_REG_FP7 UC_X86_REG_GDTR UC_X86_REG_IDTR UC_X86_REG_LDTR UC_X86_REG_TR

I understand your confusion (as I got this first time). It's hard to explain exactly why it's designed in this way, but you may find all special tuples here: https://github.com/unicorn-engine/unicorn/blob/3184d3fcdf239c77857faacf5670d5b2d64a69cd/bindings/python/unicorn/unicorn.py#L346

For a further explanation, the change was introduced here: https://github.com/unicorn-engine/unicorn/commit/e59382e030aa14e62ebd3fd867889fffb5d64a07 . I also don't have an idea why it is...

jt0dd commented 2 years ago

@wtdcode thanks, I didn't notice that code before. I'll put some thought into this. Would it be worth keeping the issue open to discussion to decide whether it makes sense to keep it this way, or potentially decide it might be better to normalize the registers? I suspect it has to do with ease-of-use for the (Unicorn) developer but, in my opinion, at the expense of (non-intuitive bindings for) the end-user of the framework. Perhaps others have differing opinions, which I would invite.

jt0dd commented 2 years ago

@cseagle I think it was your code that introduced this, or am I mistaken? Perhaps you would be the best to explain.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days.

unicorn-engine / unicorn

Why are some of the registers stored as tuples in the python binding? #1517