sysprog21 / rv32emu

Compact and Efficient RISC-V RV32I[MAFC] emulator
MIT License
402 stars 97 forks source link

Improve `JALR` execution with JIT-cache #471

Closed vacantron closed 3 months ago

vacantron commented 3 months ago

Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the JIT-ed code entry and avoids the mode change.

There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up two times, and the latter like "Fibonacci", which compiled from the hand-written assembly, can even reach x4.3 performance enhencement.

perf-indirect-jump-improvement

qwe661234 commented 3 months ago

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss. Maybe you can try to use the same way in T1C, that is parsing branch history table and including the predicted path into execution path for T2C.

vacantron commented 3 months ago

Based on my previous research, searching any code cache is not suit for our framework because we do not translate all path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss.

Yes, the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

qwe661234 commented 3 months ago

the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.

I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.

I have updated the main performance analysis picture in the top to prove the above statement.

I see, your strategy significantly improves the performance of benchmark with large number of jalr instruction, such as qsort and fib. However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

vacantron commented 3 months ago

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

qwe661234 commented 3 months ago

However, it has negative impact on other types of benchmarks like idea, and this result is the same as my research.

No, the performance degradation of IDEA comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

jserv commented 3 months ago

CI failure:

git submodule update --init tests/
Submodule 'riscv-arch-test' (https://github.com/riscv-non-isa/riscv-arch-test) registered for path 'tests/riscv-arch-test'
Cloning into '/home/runner/work/rv32emu/rv32emu/tests/riscv-arch-test'...
From https://github.com/riscv-non-isa/riscv-arch-test
 * branch              ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639 -> FETCH_HEAD
Submodule path 'tests/riscv-arch-test': checked out 'ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639'
Traceback (most recent call last):
  File "/home/runner/.local/bin/riscof", line 5, in <module>
    from riscof.cli import cli
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/cli.py", line 18, in <module>
    import riscof.framework.main as framework
  File "/home/runner/.local/lib/python3.10/site-packages/riscof/framework/main.py", line 11, in <module>
    from riscv_isac.isac import preprocessing
ImportError: cannot import name 'preprocessing' from 'riscv_isac.isac' (/home/runner/.local/lib/python3.10/site-packages/riscv_isac/isac.py)
make: *** [mk/riscv-arch-test.mk:18: arch-test] Error 1
vacantron commented 3 months ago

CI failure

This seems to be the upstream issue (https://github.com/riscv-software-src/riscof/issues/122) .

jserv commented 3 months ago

riscv-software-src/riscof#122

A temporary workaround:

pip3 install git+https://github.com/riscv/riscof.git@d38859f85fe407bcacddd2efcd355ada4683aee4
jserv commented 3 months ago

It is a bit confusing to have both block cache and jit cache in the same file. Can you clarify?

vacantron commented 3 months ago

Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call is T1C, so the quality is bad?

After researching, the performance degradation occurred when the T2C executed the T1C cache directly and bypassed the profiler.

In the previous implementation, I added all entries of the T1C-generated code to the cache (and even the part from block-chaining), but the chained blocks might be generated by parsing the branch history table and not be the main control flow. If that cache is used by the T2C, the profiler will have no chance to execute again to tag the right potential hotspot, and I think this is the main reason of the performance degradation of IDEA .

After constraining the source of cache coming from the T1C, the performance is shown at the top and it becomes more consistent with expectations.

jserv commented 3 months ago

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

vacantron commented 3 months ago

Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.

Yes, but there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

jserv commented 3 months ago

there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h and t2c.c now. Do we split them into t2c.h ?

We can consider header refactoring when integrating with frameworks like (PHP) IR or similar compiler frameworks. However, that is not necessary at this moment. Let's keep changes to a minimum.

jserv commented 3 months ago

Thank @vacantron for contributing!