ptitSeb / box64

Box64 - Linux Userspace x86_64 Emulator with a twist, targeted at ARM64 Linux devices
https://box86.org
MIT License
3.85k stars 279 forks source link

How to Count the Number of Executions of Each Dynablock in Box64? #1689

Open wangguidong1999 opened 3 months ago

wangguidong1999 commented 3 months ago

Hello, I would like to contribute to Box64 project and would like to count the number of executions of each dynablock. Is there an existing method to achieve this? Alternatively, is it possible to add some logging in the code to track the execution count? Thank you for your help!

ksco commented 3 months ago

There is no such thing on box64. DynaRec generates code pretty fast so we don't really need hot code statistics. If you still want this for some reason, it should be straightforward to add a small header before every dynablocks to do the work.

ptitSeb commented 3 months ago

I thought about doing something like that before (mainly to track used block for deletion), using the Dynablock structure to stre those information and some prolog at each dynablock entry point (don't forget blocks can statically or dynamicaly jump to other block), but I didn't implement for multile reasons:

  1. It's a waste of cpu cycles
  2. handling CALLRET optim is a harder
  3. there are multithread issue, espeicaly for hardware without proper ATOMIC support, leading to even more spent cpu cycle and memory barriers
  4. For tracking if a blck is in use or not: it's also tricky to handle native calls that don't return (probably not an issue for tracking the execution count tho)

What would be the use of those counter?

wangguidong1999 commented 3 months ago

Thank you for your quick reply.

I would like to maintain a fixed-size counter group that only records the count of recently accessed blocks. If a block's execution frequency surpasses a certain threshold, it would be recorded in a table, which is dynamically expanded. The overhead for this method involves spending a few instructions to update the counter each time a block is executed. Each block averages several hundred instructions, and the counting might cost about 10+ instructions, leading to an overhead of approximately 3-5%.

I think optimizing frequently executed blocks could improve the performance of dynarec in some programs. (as long as the improvement exceeds the overhead of the counters)

Additionally, I am not familiar with the implementation of JumpTable. I would like to learn about the lookup latency, and is there any room for optimization?

ptitSeb commented 3 months ago

Does the 3-5% also include atomic access because of the multi-thread nature of things?

Well, for now, there is no "optimized" settings in the dynarec, so even if you identify a "hot block", you can mark it as dirty for regeneration for example, but there is no secondary more optimised generator available for now...

About the JumpTable, yes, I have to write a blog entry about that, it's fairly complex in the concept, but quite optimized in execution. Not much room for improvment in it's current form (and it's something like the 3rd attempt to get fast link between blocks).

wangguidong1999 commented 3 months ago

3-5% is an estimated value, which does not take into account the influence of atomic access. I just consider single-thread apps currently.

Looking forward to your blog about JumpTable. I will appreciate it a lot.

Btw, do you have any optimization methods on your to-do list? I have a strong interest on contributing to this project.

ptitSeb commented 3 months ago

Btw, do you have any optimization methods on your to-do list? I have a strong interest on contributing to this project.

Not for now. I do have plan to introduce disk-cached dynarec block, that would allow for AoT compilation for example, but it's on long-term TODO, nothing immediate...

ksco commented 3 months ago

There is room for a custom peephole optimizer for each backend we have (without breaking the precise signal processing), but it's not an easy task.

wangguidong1999 commented 2 months ago

By the way, I am wondering if there might still be room for optimizing the translation of x86 flags in Box64?