unicorn-engine / unicorn

Unicorn CPU emulator framework (ARM, AArch64, M68K, Mips, Sparc, PowerPC, RiscV, S390x, TriCore, X86)
http://www.unicorn-engine.org
GNU General Public License v2.0
7.34k stars 1.31k forks source link

Ignore page_collection_lock #1838

Open tunz opened 1 year ago

tunz commented 1 year ago

Unicorn is already ignoring page locks, so we also don't need to manage the page collection and its tree objects. It helps us to avoid redundant object allocations and speed up the execution 3x faster for my case.

wtdcode commented 1 year ago

For the speedup, I suggest you sharing the backtrace and pprof reports if possible.

tunz commented 1 year ago

Here is a snippet of profiling. Profiled by WPR and tested on Windows.

Dev branch

- helper_le_stq_mmu_x86_64 (68.38%)
  - store_helper (67.94%)
    - notdirty_write (33.63%)
      - page_collection_lock_x86_64 (30.94%)
        - _malloc_base (15.25%)
        - g_tree_new_full (13.45%)
        - page_find_alloc (1.35%)
        - ...
      - tb_invalidate_phys_page_fast_x86_64 (1.57%)
      - ...
    - page_collection_unlock_x86_64 (15.25%)
    - _free_base (9.64%)
    - helper_ret_stb_mmu_x86_64 (3.81%)
    - find_memory_region (1.35%)
    - ...
  - ...
- helper_ret_stb_mmu_x86_64 (7.17%)
  - store_helper (7.17%)
  - ...
- helper_le_stl_mmu_x86_64 (4.04%)
- helper_lookup_tb_ptr_x86_64 (2.47%)
- helper_le_ldq_mmu_x86_64 (1.79%)
- ...

With this fix:

- helper_le_stq_mmu_x86_64 (32.70%)
  - store_helper (31.35%)
    - store_helper <itself> (15.13%)
    - notdirty_write (10.81%)
      - tb_invalidate_phys_page_fast_x86_64 (7.57%)
      - ...
    - ...
- helper_lookup_tb_ptr_x86_64 (7.30%)
- helper_le_ldq_mmu_x86_64 (5.14%)
- helper_uc_tracecode (3.78%)
- helper_ret_stb_mmu_x86_64 (3.24%)
  - store_helper (2.97%)
  - ...
- ...

Percentage means a proportion of CPU usage. Execution time is reduced from 370s to 116s.

wtdcode commented 1 year ago

Here is a snippet of profiling. Profiled by WPR and tested on Windows.

Dev branch


- helper_le_stq_mmu_x86_64 (68.38%)
  - store_helper (67.94%)
    - notdirty_write (33.63%)

That's it and it matches my guess, this is slow because unicorn (qemu) has to ensure previous writes does invalidate any previous dirty code pages. This is specially handled due to Self Modifying Code (refer to qemu whitepaper if you wish).

  - page_collection_lock_x86_64 (30.94%)
    - _malloc_base (15.25%)
    - g_tree_new_full (13.45%)
    - page_find_alloc (1.35%)
    - ...
  - tb_invalidate_phys_page_fast_x86_64 (1.57%)
  - ...
- page_collection_unlock_x86_64 (15.25%)
- _free_base (9.64%)
- helper_ret_stb_mmu_x86_64 (3.81%)
- find_memory_region (1.35%)
- ...
  • ...
    • helper_ret_stb_mmu_x86_64 (7.17%)
  • store_helper (7.17%)
  • ...
    • helper_le_stl_mmu_x86_64 (4.04%)
    • helper_lookup_tb_ptr_x86_64 (2.47%)
    • helper_le_ldq_mmu_x86_64 (1.79%)
    • ...

With this fix:

- helper_le_stq_mmu_x86_64 (32.70%)
  - store_helper (31.35%)
    - store_helper <itself> (15.13%)
    - notdirty_write (10.81%)
      - tb_invalidate_phys_page_fast_x86_64 (7.57%)
      - ...
    - ...
- helper_lookup_tb_ptr_x86_64 (7.30%)
- helper_le_ldq_mmu_x86_64 (5.14%)
- helper_uc_tracecode (3.78%)
- helper_ret_stb_mmu_x86_64 (3.24%)
  - store_helper (2.97%)
  - ...
- ...

Percentage means a proportion of CPU usage. Execution time is reduced from 370s to 116s.

I think a better solution is to check how qemu solves this because I feel that it shouldn't have too much overhead. Maybe we miss some core mechanism when manipulating qemu code.

tunz commented 1 year ago

Oh, thanks for the hint. Now I see this problem: Unicorn always returns false for cpu_physical_memory_get_dirty and true for cpu_physical_memory_is_clean. So, tlb_set_dirty is never called in notdirty_write, and Unicorn always takes the slow path.

wtdcode commented 1 year ago

Oh, thanks for the hint. Now I see this problem: Unicorn always returns false for cpu_physical_memory_get_dirty and true for cpu_physical_memory_is_clean. So, tlb_set_dirty is never called in notdirty_write, and Unicorn always takes the slow path.

Thanks for your hint. This is an old known issue I notice when I developed Unicorn2. The root cause is a bit long story but keep it for short: for uc1 compatibility . However, now we have some ctl for flushing code cache so this could be improved in some way but I need to figure it out.