Closed vacantron closed 3 months ago
Based on my previous research, searching any code cache is not suit for our framework because we do not translate all
path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss. Maybe you can try to use the same way in T1C, that is parsing branch history table and including the predicted path into execution path for T2C.
Based on my previous research, searching any code cache is not suit for our framework because we do not translate
all
path into T1C or T2C. Therefore, you will get performance loss due to frequent cache miss.
Yes, the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.
I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.
I have updated the main performance analysis picture in the top to prove the above statement.
the cache-miss might cancel the improvement due to the cache searching overhead, so the implementation of searching cache is important here.
I have tried several implementations like searching in static array, searching in hash table, etc., and I found the hardware-like caching is the most friendly approach for our framework, which its overhead can be almost disregarded.
I have updated the main performance analysis picture in the top to prove the above statement.
I see, your strategy significantly improves the performance of benchmark with large number of jalr
instruction, such as qsort
and fib
. However, it has negative impact on other types of benchmarks like idea
, and this result is the same as my research.
However, it has negative impact on other types of benchmarks like
idea
, and this result is the same as my research.
No, the performance degradation of IDEA
comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.
However, it has negative impact on other types of benchmarks like
idea
, and this result is the same as my research.No, the performance degradation of
IDEA
comes from the invocation of JIT-generated code. Since the second bar (orange) shows the overhead of all cache-miss, the last bar (green) only shows the performance variety of the JIT-ed code invocation, which the increment/decrement of it is decided by the quality of generated code.
Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for call
is T1C, so the quality is bad?
CI failure:
git submodule update --init tests/
Submodule 'riscv-arch-test' (https://github.com/riscv-non-isa/riscv-arch-test) registered for path 'tests/riscv-arch-test'
Cloning into '/home/runner/work/rv32emu/rv32emu/tests/riscv-arch-test'...
From https://github.com/riscv-non-isa/riscv-arch-test
* branch ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639 -> FETCH_HEAD
Submodule path 'tests/riscv-arch-test': checked out 'ed32d6767e5c35fa98dfc5aa91d7f4b199f8c639'
Traceback (most recent call last):
File "/home/runner/.local/bin/riscof", line 5, in <module>
from riscof.cli import cli
File "/home/runner/.local/lib/python3.10/site-packages/riscof/cli.py", line 18, in <module>
import riscof.framework.main as framework
File "/home/runner/.local/lib/python3.10/site-packages/riscof/framework/main.py", line 11, in <module>
from riscv_isac.isac import preprocessing
ImportError: cannot import name 'preprocessing' from 'riscv_isac.isac' (/home/runner/.local/lib/python3.10/site-packages/riscv_isac/isac.py)
make: *** [mk/riscv-arch-test.mk:18: arch-test] Error 1
CI failure
This seems to be the upstream issue (https://github.com/riscv-software-src/riscof/issues/122) .
riscv-software-src/riscof#122
A temporary workaround:
pip3 install git+https://github.com/riscv/riscof.git@d38859f85fe407bcacddd2efcd355ada4683aee4
It is a bit confusing to have both block cache and jit cache in the same file. Can you clarify?
Could you further explain why the performance of all cache misses is better than your improvement? The JIT-ed code stored in the code cache for
call
is T1C, so the quality is bad?
After researching, the performance degradation occurred when the T2C executed the T1C cache directly and bypassed the profiler.
In the previous implementation, I added all entries of the T1C-generated code to the cache (and even the part from block-chaining), but the chained blocks might be generated by parsing the branch history table and not be the main control flow. If that cache is used by the T2C, the profiler will have no chance to execute again to tag the right potential hotspot, and I think this is the main reason of the performance degradation of IDEA
.
After constraining the source of cache coming from the T1C, the performance is shown at the top and it becomes more consistent with expectations.
Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.
Does JIT cache make sense only to T2C enabled builds? If so, it should be part of T2C component if we found no regressions.
Yes, but there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in jit.h
and t2c.c
now. Do we split them into t2c.h
?
there is no header file for the declarations of the T2C implementations, so the related type definitions and code are placed in
jit.h
andt2c.c
now. Do we split them intot2c.h
?
We can consider header refactoring when integrating with frameworks like (PHP) IR or similar compiler frameworks. However, that is not necessary at this moment. Let's keep changes to a minimum.
Thank @vacantron for contributing!
Currently, the "JALR" indirect jump instruction turns the mode of rv32emu from T2C back to the interpreter. This commit introduces a "JIT-cache" table lookup to make it redirect to the JIT-ed code entry and avoids the mode change.
There are several scenarios benefitting from this approach, e.g. function pointer invocation and far-way function call. The former like "qsort" can be speeded up two times, and the latter like "Fibonacci", which compiled from the hand-written assembly, can even reach x4.3 performance enhencement.