unicorn-engine / unicorn

Unicorn CPU emulator framework (ARM, AArch64, M68K, Mips, Sparc, PowerPC, RiscV, S390x, TriCore, X86)
http://www.unicorn-engine.org
GNU General Public License v2.0
7.61k stars 1.35k forks source link

IMUL mem/Ib DWORD; OPC_IMUL_GvEvIb; "imul eax, [ecx+0x41], 0x10" #364

Closed egberts closed 2 years ago

egberts commented 8 years ago

In Intel IMUL opcode described in one method using different descriptions:

When encountering an IMUL opcode of i386 architecture, specifically when using with an immediate 8-bit multiplier, the emulator engine does not properly multiply the number 0x5151494a by 0x10 and get the expected 0x151494a0 result.

Interesting thing, is that a standalone IMUL instruction works, but this stack-based AlphaMixed code snippet (derived from Metasploit) failed.

Will submit working proof-of-failure code soon.

aquynh commented 8 years ago

yes, please put the testcases under tests/. if your code is in C, preferred place is under tests/unit.

thanks.

egberts commented 8 years ago

Unable to procure a working FIX of QEMU engine, but I have a clearly documented unit test C file for this troubleshooting of IMUL aex, mem, Ib" and securing a pass/fail scenario.

test_i386_imul_eax_r_ib.c.zip

$ unzip test_i386_imul_eax_r_ib.c.zip $ gcc test_i386_imul_eax_r_ib.c -lunicorn $ a.out

(Just started refitting my C code for tests/unit style)

egberts commented 8 years ago

I'm staring at the tcg_gen_muls2_i32 function... in GDB, perhaps this is @farmdve's cup of tea because this signed IMUL operation got messed before it stores that corrupted value into the EAX register?

        case MO_32:
            tcg_gen_trunc_tl_i32(tcg_ctx, cpu_tmp2_i32, *cpu_T[0]);
            tcg_gen_trunc_tl_i32(tcg_ctx, cpu_tmp3_i32, *cpu_T[1]);
            tcg_gen_muls2_i32(tcg_ctx, cpu_tmp2_i32, cpu_tmp3_i32,
                              cpu_tmp2_i32, cpu_tmp3_i32);
            tcg_gen_extu_i32_tl(tcg_ctx, *cpu_regs[reg], cpu_tmp2_i32);
            tcg_gen_sari_i32(tcg_ctx, cpu_tmp2_i32, cpu_tmp2_i32, 31);
            tcg_gen_mov_tl(tcg_ctx, cpu_cc_dst, *cpu_regs[reg]);
            tcg_gen_sub_i32(tcg_ctx, cpu_tmp2_i32, cpu_tmp2_i32, cpu_tmp3_i32);
            tcg_gen_extu_i32_tl(tcg_ctx, cpu_cc_src, cpu_tmp2_i32);

https://github.com/unicorn-engine/unicorn/blob/master/qemu/target-i386/translate.c#L5473

egberts commented 8 years ago

Hard to compare with QEMU of the old...

http://git.qemu.org/?p=qemu.git;a=blob;f=target-i386/translate.c;h=a3dd167a9b20c0d25d612659908f04c00e0a1ead;hb=HEAD#l4817

farmdve commented 8 years ago

The problem is not with the instruction decoder. Your code is self-modifying code, it modifies the value the IMUL multiplier from 0x51 to 0x10, usually QEMU handles self-modifying code, so this should work, but for some reason QEMU is using the old value 0x51, instead of 0x10(despite what uc_mem_read returns). Meaning the translation cache is not flushed.

0x5151494a + 0x51 = 0xBAB8306A 0x5151494a + 0x10 = 0x151494A0

There have been other instances of Unicorn failing with self-modifying code, so thanks for finding this.

@aquynh Somewhere we need to flush the translation block.

aquynh commented 8 years ago

@egberts, so can you put your testcase under tests/ directory?

egberts commented 8 years ago

Yes.

egberts commented 8 years ago

@aquynh , done.

First attempt was incorrectly added under tests/regress and that 1st pull-request has been closed, 2nd attempt was recoded to fit tests/unit style and 2nd pull-request opened at #380

farmdve commented 8 years ago

I traced the issue to the self-modified code check in translate-all.c. Even though the code is modified, this check fails

if (!(tb_end <= start || tb_start >= end)) in tb_invalidate_phys_page_range and the page isn't invalidated.

egberts commented 8 years ago

Me setting a breakpoint at tb_invalidate_phys_page_range() isn't working for me as that function isn't mapped yet at GDB main() time... Do we have an awesome .gdbinit or defines to pass to make.sh (or something) to help us get to this threaded point faster?

lunixbochs commented 8 years ago

I've had luck just letting a program run once in gdb before trying to add the library breakpoint.

farmdve commented 8 years ago

@egberts

The functions have a suffix which depends on the arch. tb_invalidate_phys_page_range_x86_64 for X86 arch, and tb_invalidate_phys_page_range_arm for ARM.

farmdve commented 8 years ago

This is the most bothersome bug. I have not looked at it in a while, but it's the first time that self-modifying code is not detected properly and thus TB page/region is not invalidated.

aquynh commented 8 years ago

a quick glance, and this testcase seems a bit too complicated. can someone make a minimal testcase, so it is shorter & simpler?

egberts commented 8 years ago

Ummm, possibly... Are you looking for a smaller x86 code footprint (size of buffer)? Or are you looking to keep the test code simpler?

aquynh commented 8 years ago

Yes, please make the x86 code as small as possible, by just keeping relevant instructions. Thanks

egberts commented 8 years ago

Will do

egberts commented 8 years ago

OK. Got the RIP (PC) to start exactly at before the self-modifying opcode (0x60000021) by using a snapshot of the register set content and setting these new register values in the test/unit/test_tb_x86.c test code:

0x021: 30 41 30    xor     byte ptr [ecx + 0x30], al      # modify immediate operand of imul opcode
0x024: 41          inc     ecx
0x025: 6b 41 41 51 imul    eax, dword ptr [ecx + 0x41], 0x51

Train your eye at the stack region dump of tests/unit/test_tb_x86.c output after each instruction

60000020: 50 30 41 30 41 6b 41 41 51 32 41 42 32 42 42 30

until imul opcode's immediate byte operand got modified at offset 0x60000028 to 0x10 from 0x51:

60000020: 50 30 41 30 41 6b 41 41 10 32 41 42 32 42 42 30

I wanted to keep the original code sequence for later but fuller unit test of invalidating translation cache...so a C preprocessor define #define RIP_NEXT_TO_THE_SELFMODIFY_OPCODE is inserted, set and used. Will PR next.

egberts commented 8 years ago

Would it make sense to perform at UC_MEM_WRITE hook routine and do the following:

If so, we would need API for that, no?

egberts commented 8 years ago

tb_flush() in UC_MEM_WRITE hook didn't work. (Yeah, I made me a new temporary API).

At the time of executing an already self-modified code, the cpu_ldub_code_x86_64()/glue() didn't expectedly obtain the modified instructions at disas_insn:4794.

Going back to the modifying instruction to see what it takes to 'push' the modified instructions down to the (perhaps, looking for that cpu_stub_code_x86_64() equivalent (doesn't exist).

egberts commented 8 years ago

Uses TVGv_i64 scratch memory 0x15 - contains value 0x10 (AL register content) 0x16 - scratch XOR operation 0x17- contains indexing result of [ecx + 0x30]

This snippet in translate.c:1519

case OP_XORL:
 tcg_gen-xor_tl(tgc_ctx, *cpu_T[0], *cpu_T[0], *cpu_T[1]);  # constructed 0x10
 gen_op_st_rm_T0_A0(s, ot, d);  # put in 0x10 in memory
 gen_op_update1_cc(tcg_ctx);
 set_cc_op(s, CC_OP_LOGICB + ot);
 break

In gen_op_st_rm_T0_A0(DisasContent, 0, OR_TMP0), calls gen_op_st_v(s, idx=0, t0=0x15, a0=0x17), which calls tcg_gen_qemu_st_tl(s->uc, t0=0x15, addr=0x17, idx=0x2, memop=MO_8), which calls tcg_canonicalize_memop_x86_64(op=MO_8, is64=0x1, st=0x1)

Nothing is invalidating the TCG. It proceeds run TCG unhindered. I was expecting some kind of flush_icache_range() after passing memory region boundary test or something. (cpu_flush_icache_range() is being ignored while in TCG-mode).

farmdve commented 8 years ago

I think you are looking at the wrong code, however tb_flush should in fact flush the entire cache and retranslate, so...not sure why it isn't.

egberts commented 8 years ago

I flipped the DEBUG_FLUSH define on, and within my hook_write(), performed the following:

  1. check range of write memory access region
static void
hook_mem32_write(uc_engine *uc, uc_mem_type type, uint64_t address, int size, int64_t value, void *UNUSED(user_data))
{
    switch(type) {
      default: break;
      case UC_MEM_WRITE:
        if (0 != uc_mem_check(uc, address, size)) { // bool: True/False
            printf("SELF-MODIFYING CODE\n");
            uc_tb_flush(uc);
        }
    }
}
  1. call my new uc_tb_flush(), which in turns did the following:
uc.c
UNICORN_EXPORT
uc_err uc_tb_flush(uc_engine *uc)
{
    CPUState *cpu = uc->cpu;
    void *env = cpu->env_ptr;
    printf("FLUSHED!\n\n");
    tb_flush_x86_64(env);
    return 0;
}

and got this:

qemu: flush code_size=2240 nb_tbs=1 avg_tb_size=2240

later uc_mem_read() retrieval of modified memory of its instruction shows this to be correct; however the actual TCG behavior of its immediate AL operand of IMUL still hasn't been updated by the 'self-modifying' xor instruction.

egberts commented 8 years ago

having the following

#define DEBUG_TB_INVALIDATE 
#define DEBUG_TB_CHECK

Both reveal nothing.

egberts commented 8 years ago

Putting a breakpoint on tb_invalidate_phys_page_fast_x86_64() has been evoked by the following backtrace at point of modifying the immediate operand of IMUL instruction by the XOR instruction:

notdirty_mem_write()
memory_region_write_access_x86_64()
access-with-adjusted_size_x86_64()
memory_region_dispatch_write_x86_64()
io_mem_write_x86_64()
io_writeb_x86_64()
helper_ret_stb_mmu_x86_64()
static_code_gen_buffer()
?? ()
?? ()
?? ()
?? ()
tcg_out_opc()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Looks like I hit something esoteric for it passes valgrind and runs generally fine. (I'm compiling with -g3 -ggdb).

egberts commented 8 years ago

I see in qemu/translate-all.c:1414 that

  tb_invalidate_phys_page_range(uc, 0x28, 0x28 + 1, 1);

That's what we want.

Doesn't exceed SMC_BITMAP_USE_THRESHOLD (10), so no build_page_bitmap(p)

egberts commented 8 years ago

Going deeper, I see that:

qemu_get_ram_ptr(uc, 0x28) (of the stb_p(qemu_get_ram_ptr(uc, 0x28)) line) is returning the correct 0x10

Now at tlb_set_dirty(env, uc->current_cpu->mem_io_vaddr = 0x60000028)

Making a note of all dirty pages.

Physical memory has the correct self-modified imul operand of 0x10 as evidenced by gdb command line

(gdb) print qemu_get_ram_ptr_x86_64( uc, 0x28)
$99 = 0x51
# stepping over st_p_x86_64()
(gdb) print qemu_get_ram_ptr_x86_64(uc, 0x28)
$100 = 0x10

So, physical memory has the updated info. (not getting far, but it is a lesson for me).

egberts commented 8 years ago

For those who need a QEMU refresher, Fabrice put out this awesome design architecture overview of QEMU.

http://archives.cse.iitd.ernet.in/~sbansal/csl862-virt/2010/readings/bellard.pdf

egberts commented 8 years ago

Excerpt from Fabrice's research paper.

3.6 Self-modifying code and translated code invalidation On most CPUs, self-modifying code is easy to handle because a specific code cache invalidation instruction is executed to signal that code has been modified. It suffices to invalidate the corresponding translated code.

However on CPUs such as the x86, where no instruction cache invalidation is signaled by the application when code is modified, self-modifying code is a special challenge [3]

When translated code is generated for a TB, the corresponding host page is write protected if it is not already read-only. If a write access is made to the page, then QEMU invalidates all the translated code in it and reenables write accesses to it.

Correct translated code invalidation is done efficiently by maintaining a linked list of every translated block contained in a given page. Other linked lists are also maintained to undo direct block chaining.

When using a software MMU, the code invalidation is more efficient: if a given code page is invalidated too often because of write accesses, then a bitmap representing all the code inside the page is built. Every store into that page checks the bitmap to see if the code really needs to be invalidated. It avoids invalidating the code when only data is modified in the page

[3] For simplicity, QEMU will in fact implement this behavior of ignoring the code cache invalidation instructions for all supported CPUs.

Before I go searching for that code fragment that performs the instruction cache invalidation of the Translation Cache, specifically with regard to a write-protected block, and loosening up the "is it a write-protected block" logic, any pointers?

farmdve commented 8 years ago

You were close. I even mentioned it a while back where the problem stemmed from. https://github.com/unicorn-engine/unicorn/issues/364#issuecomment-172326332

Your trace even pointed you in the right direction, notdirty_mem_write().

egberts commented 8 years ago

Can i say this on GitHub?

-->> #headdesk <<--

egberts commented 8 years ago

The bug has been narrowed down to in tlb_set_page/qemu/cputlb.c255:

Recap: In a stack-based (writable) memory, code modification occurs: an xor instruction modifies the imul immediate byte operand (2 instructions down the code space). Fault is that QEMU does not properly regenerates a new translation block using this self-modified target code that got changed by its recent self-modifying code snippet, instead it still recreates a new translation block but using the old host code space, untouched. (meaning that TB is still identical to what it was before).

Starting code sequence:

0x021: 30 41 30    xor     byte ptr [ecx + 0x30], al      # modify immediate operand of imul opcode
0x024: 41          inc     ecx
0x025: 6b 41 41 51 imul    eax, dword ptr [ecx + 0x41], 0x51

After PC advanced to 0x024 (and 0x028 got modified), the ideal new code sequence

0x021: 30 41 30    xor     byte ptr [ecx + 0x30], al      # modify immediate operand of imul opcode
0x024: 41          inc     ecx
0x025: 6b 41 41 10 imul    eax, dword ptr [ecx + 0x41], 0x10

In comparing two code snippets above, xor at 0x60000021 overwrite/modifies imul' 0x51 with a new value of 0x10 at address 0x60000028.

Real Unicorn test_tb_x86.c test run: Looks like this using uc_mem_read() after executing 1st instruction

0x021: 30 41 30    xor     byte ptr [ecx + 0x30], al      # modify immediate operand of imul opcode
0x024: 41          inc     ecx
0x025: 6b 41 41 10 imul    eax, dword ptr [ecx + 0x41], 0x10

Performed actual TCG emulation as:

0x021: 30 41 30    xor     byte ptr [ecx + 0x30], al      # modify immediate operand of imul opcode
0x024: 41          inc     ecx
0x025: 6b 41 41 51 imul    eax, dword ptr [ecx + 0x41], 0x51

So, uc_mem_region() is using qemu_get_ram_ptr()/RAMBlock for all their data retrieval/store...CORRECTLY.

TCG, not so much.

We'll need to track this code change within the same memory block that caused such an old codespace not to be updated. There are several places in which to store our self-modifying instruction sequence in question (all of which were learned by me in advance so I can direct your attention up front).

  1. uc_mem_write(...,buffer,...)
  2. (RAMBlock *)uc->ram_list.block via qemu_get_ram_ptr() 3 env->iotlb[2][0] 3 env->iotlb_v[2][0] = env->iotlb[2][0] = 0x60000000

Using "awatch", we can then monitor the progression of reading/writing of our 0x51 value(s).... but first, we need to determine those address candidates for debug watching.

For uc_mem_write buffer, a breakpoint to uc_mem_write is used to get its _bytes buffer argument and 0x28 offset added

 b uc_mem_write
 continue
 print bytes[0x28]  # confirm that the value is 0x51
 print &bytes[0x28]  # get address of 0x51 value
 awatch -l 0x401a98

For the QEMU RAMBlock, that location initially got written also by uc_mem_write(), a breakpoint address_space_rw_x86_64 was used to intercept and then 0x28 was added to its value.

 b address_space_rw_x86_64
 continue
 print addr1
 awatch -location *0x7fffdd000028

Running this from the beginning of the code, we have a set of data watchpoint hits:

   0x401a98
   0x7fffdd000028

all before uc_emu_start()

After uc_emu_start(), initial creation of bytecode sequence via cpu_x86_exec/cpu_x86_gen_code/disas_insn loops begins:

  helper_uc_tracecode HOOK_INSN (0x60000021)  
  helper_ret_ldub_mmu_x86_64
  helper_ret_stb_mmu_x86_64   <----- iotlb Bug is here
  helper_uc_tracecode HOOK_INSN (0x60000021) # 'xor' opcode
  helper_ret_ldub_mmu_x86_64
  helper_ret_stb_mmu_x86_64
  helper_read_eflags
  helper_uc_tracecode HOOK_INSN(0x60000024) # 'inc' opcode
  helper_le_ldul_mmu_x86_64
  helper_read_eflags
  helper_uc_tracecode HOOK_INSN # 'imul' opcode (*ERROR DETECTED*)

Then emulation execution begins out of static_code_gen_buffer[].

At the moment of xor self-modify step, a new TCG got regenerated, the next awatch hit came too late...we're already begun the regeneration of its emulation part of 'imul eax,ecx[0x41],0x51'...

I was expecting to see emulation steps of writing an updated value of 0x10 into the correct MemoryRegion location for later pickup for TCG regeneration of the revised imul ...,...,0x10 instruction.

things that invalidate pages (aka tb_invalidate_phys_page_range) due to 'xor' writing 0x10 in place of 0x51 in physical RAM
  static_code_gen_buffer
    helper_ret_stb_mmu_x86_64  # 'xor'
      io_writeb_x86_64
        io_mem_write_x86_64
          memory_region_dispatch_write_x86_64
            access_with_adjusted_size_x86_64 (wrote 0x10 to 0x7fffddabb658) # wrong location
              memory_region_write_accessor_x86_64
                notdirty_mem_write(uc, opaque=0x0, ram_addr=0x28, val=0x10, size=0x1)
                  tb_invalidate_phys_page_fast_x86_64
                    tb_invalidate_phys_page_range_x86_64((uc, start=0x28, end=0x29, is_cpu_write_access=0x1)
                      p = (PageDesc *) 0x7fffd80175f0 = {first_tb = 0x7fffddabd010, code_write_count = 0x0, code_bitmap = 0x0}
                      cpu_restore_state_from_tb
                        tcg_func_start
                          tcg_pool_reset
                          memclr(s->free_temps)
                          tcg_ctx->gen_opc_ptr = in s->gen_opc_buf
                          tcg_ctx->be = TCGBackendData[640]
                        gen_intermediate_code_pc_x86_64
                          x86_env_get_cpu
                          gen_intermediate_code_internal(gen_opc_cc_op=0x7ffff7e69b7c "", cpu=0x640180, tb=0x7fffddabd010, search_pc=0x1)
                            gen_tb_start
                            disas_insn(env, tcg_ctx, pc_start=0x60000021) # `xor`
                            disas_insn(env, tcg_ctx, pc_start=0x60000024) # `inc`
                            disas_insn(env, tcg_ctx, pc_start=0x60000025) # `imul`
                              b = 0x6b = cpu_ldub_code(env, s->pc=0x60000026)
                              modrm = 0x41 = cpu_ldub_code(env, s->pc=0x60000027)
                              insn_get(env, s, ot=MO_8)
                                cpu_ldub_code_x86_64(env, addr=ptr=0x60000028)
                                  page_index = 0
                                  mmu_idx = 0x2
                                  &env->tlb_table[2][0] = (CPUTLBEntry *) 0x64d1f0 = {addr_read = 0x60000000, addr_write = 0x60000010, addr_code = 0x60000000, addend = 0x7fff7d000000, dummy = 0x64d210 '\377' <repeats 200 times>...}
                                  hostaddr = 0x7fffdd000028
                                  *hostaddr = 0x51
                                  ldub_p_x86_64(ptr=0x7fffdd000028)
                                  res = 0x51 # WRONG!

and 0x51 did not get changed into the expected 0x10 immediate byte operand value for imul opcode.

While still in static_code_gen_buffer[] emulation mode, my shifting focus now is at first disas_insn() iteration step (the xor) (also at during the first instruction being emulated 'xor') , but has been evolving:

I'm seeing two separate MemoryRegion *mr being used here.... One by iotlb (io_mem) and other by pc.ram. The missing 0x10 write went to the OTHER MemoryRegion block.

aquynh commented 8 years ago

it seems you are getting very far. looks forward to the fix for this issue :-)

egberts commented 8 years ago

We have established with previous calltrace that tb_invalidate_phys_page_fast-triggered regeneration of host codes did not pick up the changed operand within its own memory page.

I've narrowed the first faulty code down to exactly one liner helper_ret_stb_mmu_x86_64:722:

#0 helper_ret_stb_mmu_x86-64(env=0x14f7e070, addr=0x60000028, val=0x10, mmu_idx=0x2, retaddr=0xf5bd08)
    at /tmp/unicorn-master/qemu/softmmu_template.h:722
722          if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {

The local variables are:

(gdb) info local
index = 0x0
tlb_addr = 0x60000010
haddr = 0x60000000
hook = 0x14f67240
handled = 0x51
uc = 0x14f5c130
mr = 0x14f6ad20
cur = 0x0

It was a mistake to treat stack-based "pc.ram" memory as an IO Memory. Once I faked the tlb_addr with a forged 0x60000000 (replacing 0x60000010, thereby removing the TLB_NOTDIRTY bit), it worked perfectly in placing the 0x10 over the 0x51 by this line 787 (also of the same routine).

Now, I need to go step back some more instruction as to find the right placement, probably the failure of getting the correct MemoryRegion pointer (getting io-mem-type memory which used the notdirty_mem_ops(), instead of pc.mem which -> probably <- provides the desired unassigned_mem_ops_x86_64()).

The forging of tlb_addr resulted in proper placement of punching in the new 0x10 operand into the "pc.mem" memory region. Line 787 and 789 are the only place in the whole QEMU code that back-updates the pc.ram with self-modifying code.

But it didn't fix the whole problem. Stay tuned.

egberts commented 8 years ago

How about that.... env->tlb_table[][].addr_write is never marked 'dirty' by the removal of its TLB_NOTDIRTY flag.

UPDATED: Found it through an veiled-named and coded function: tlb_set_dirty1().

egberts commented 8 years ago

@aquynh, I have to ask this question (not seeking a change here althought).... Unicorn is making a RedHat-labeled "qemu:memory-region" MemoryRegion structure in `qemu/memory.c:953

Yet, it treats it as an io.mem as denoted in memory_region_init_io:1200.

Too bad, we can't label memory precisely /rant.

egberts commented 8 years ago

Questions for QEMU whiz: What are the proper logic to determine an io-mem/qemu:memory-region? Is it a fall-thru?

    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK))

above logic certainly isn't working for a user-supplied uc_mem_write ("pc.ram") memory region.

But it is what is being used also in the tlb_vaddr_to_host().

Perhaps, it was MARKED as an "io.mem" by accident.

Yep, according to include/exec/cpu-defs.h:

    typedef struct CPUTLBEntry {
        /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
            bit TARGET_PAGE_BITS-1..4 : Nonzero for accesses that should not go directly to ram.

User-supplied "pc.mem" via uc_mem_write was MARKED as TLB_NOTDIRTY (bit 4), perhaps by accident? As in tlb_addr = 0x60000010, where 0x10 part represent the TLB_NOTDIRTY.

Real question is, are we supposed to be supporting virtual address scheme with the user-supplied uc_mem_write() memory regions?

egberts commented 8 years ago

Using GDB awatch -l env->tlb_addr[2][0].addr_write command and re-ran the whole thing to see what code segments were being used to set it up, particularly that TLB_NOTDIRTY bit (0x10). The read/write hits are:

  R/W  address         callstack
  write 0xffffffffffffffff uc_open()
  write 0xffffffffffffffff uc_open()
  write 0xffffffffffffffff uc_mem_map()
  read  0xffffffffffffffff qemu_tcg_cpu_thread_fn_x86_64/get_page_addr_code_x86_64/tlb_set_page_x86_64
  write 0x        60000000 qemu_tcg_cpu_thread_fn_x86_64/get_page_addr_code_x86_64/cpu_ldub_code_x86_64/helper_ldb_cmmu_x86_64/helper_ret_ldb_cmmu_x86_64/tlb_fill_x86_64/x86_cpu_handle_mmu_fault/tlb_set_page_x86_64
    # 1st-time code generation triggered-by-CPU-page-fault
  read  0x        60000000 qemu_tcg_cpu_thread_fn_x86_64/tb_find_slow_x86_64/tb_gen_code_x86_64/tb_link_page_x86_64/tb_alloc_page_x86_64/tlb_protect_code_x86_64/cpu_physical_memory_reset_dirty_x86_64/cpu_tlb_reset_dirty_all_x86_64/tlb_reset_dirty_range_x86_64/tlb_is_dirty_ram_x86_64/
    # Marked as TLB_NOTDIRTY after uc_emu_start()-time via start_thread()
  write 0x        60000010 qemu_tcg_cpu_thread_fn_x86_64/tcg_exec_all_x86_64/tcg_cpu_exec_x86_64/cpu_x86_exec/tb_find_fast_x86_64/tb_find_slow_x86_64/tb_gen_code_x86_64/tb_link_page_x86_64/tb_alloc_page_x86_64/tlb_protect_code_x86_64/cpu_physical_memory_reset_dirty_x86_64/tlb_reset_dirty_range_all_x86_64/cpu_tlb_reset_dirty_all_x86_64/tlb_reset_dirty_range_x86_64

Maybe QEMU forgot to use (tlb_entry->addr_write & TARGET_PAGE_MASK), let's see what the next awatch hit shows:

  read  0x        60000010 helper_ret_stb_mmu_x86_64

Seeing that 0x10 in 0x60000010, this means TLB_NOTDIRTY is set and that "io.mem" MemoryRegion is to be used during its actual TB microcode emulation at the 'xor' instruction. I was half expecting "pc.ram" and not seeing that TLB_NOTDIRTY (0x10).

Too bad, the helper_ret_stb_mmu cannot make that distinction of what kind of memory it should be (yet). This means, it has to be determine way before the new emulation is being done.

Checkpoint, still at the right spot:

helper_ret_stb_mmu_x86_64 (env=0x14f7e070, addr=0x60000028, val=0x10, mmu_idx=0x2, retaddr=0xf5bd08)
        at /tmp/unicorn-master/qemu/softmmu_template.h:703
egberts commented 8 years ago

Looks like there isn't much of a "write-through" mechanism for updating "io.mem/qemu:memory-region" as well as "pc.ram" MemoryRegion.

New code logic, it is going to be.

aquynh commented 8 years ago

@aquynh, I have to ask this question (not seeking a change here althought).... Unicorn is making a RedHat-labeled "qemu:memory-region" MemoryRegion structure in `qemu/memory.c:953

Yet, it treats it as an io.mem as denoted in memory_region_init_io:1200

good question, but i am also confused here. the high level overview of Qemu internals is sorely lacking, and sometimes reading source code is not enough to get the whole picture.

you might want to post this question to the Qemu mailing list to get the quick answer.

egberts commented 8 years ago

I think the current memory stack design looks like this: TARGET VIRTUAL MEMORY (0x600000028) "pc.ram" TLB MEMORY (TCG-view) (0x028) "qemu:memory-region" / "io.mem" iotlb_to_region( ioaddr=0xffffffffa00001 ) HOST VIRTUAL (REAL RAM) ADDRESS (0x7fffdd0000000028)

At the moment, the current design APPEARS to me that the "io.mem" is "writeable" but not backflushable to the "pc.ram"

Until I understand the distinction between "pc.ram" and "io.mem" MemoryRegion concept with regard to Unicorn executive function, I am not going to be blazing forward with a fix here anytime soon.

farmdve commented 8 years ago

@egberts

Are you sure force flushing the translation block does not fix the issue? It should re-translate the new code.

egberts commented 8 years ago

Oh, it does do a tb_invalidate_phys_page_fast() in the notdirty_mem_write() at the time of emulating the xor self-modifying, just that the during the regeneration of TranslationBlock, it picked up the old 0x51 immediate operand byte value from "pc.mem" as evidenced by this sequence in qemu/exec.c:1378.

As you can see the sequence, the new 0x10 immediate operand byte value got written AFTER the tb_invalidate_phys_page_fast(), hence the retrieval of old 0x51 by the TCG before. Thanks, @farmdve. I've since summarized it and posted on qemu-devel.

egberts commented 8 years ago

No, it's not a simple matter of switching logic around in notdirty_mem_write(). There's multiple invalidation interactions in play there.

egberts commented 8 years ago

During TranslationBlock regeneration (cpu_restore_state_from_tb()), a 3rd breakpoint at disasm_insn() (to where imul is at, needs to retrieve the immediate byte operand in question), shows that it directly lifted the 0x51 value from "pc.mem" via host memory pointer (0x7fffdc00000028).

After regeneration, then it stores the new 0x10 value into that same "pc.mem" location.

egberts commented 8 years ago

@farmdve, here's what Paolo Bonzini (of QEMU) said the following:

Newer versions (I think 2.5) have dispensed with code regeneration altogether.

OK... perhaps, it's time to move forward to 2.5 (gonna spend a little time on getting redundancies into current QEMU 2.2)

@aquynh, about that 2.5 branch...

farmdve commented 8 years ago

Yeah, the 2.5 branch sounds like a better start, buuut it won't be easy at all.

Furthermore, if that does happen we need to think of a better way to separate Unicorn and QEMU, as to be able to port fixes more easily.

JonathonReinhart commented 8 years ago

we need to think of a better way to separate Unicorn and QEMU, as to be able to port fixes more easily.

I think the only way to do this is to get the necessary hooks for Unicorn (or the likes) in the mainline Qemu code.

egberts commented 8 years ago

Me think the bug still exist in the QEMU 2.5 (exec.c is largely unchanged).

aquynh commented 8 years ago

@egberts i am working on 2.5 sync, but that is slow. i believe you are much closer to the solution for this issue than waiting for my work to be done. i can imagine that we release v1.0 without this sync.

@farmdve @JonathonReinhart perhaps it is possible to work towards merging a part of Unicorn to Qemu, but that is also a lot of efforts. many parts of Unicorn are not really inline with Qemu, so we must still maintain our fork.