rsd-devel / rsd

RSD: RISC-V Out-of-Order Superscalar Processor
Apache License 2.0
980 stars 98 forks source link

Cache and BP Disable Possible Way #85

Closed YuanPol closed 6 months ago

YuanPol commented 6 months ago

Hello, I want to ask if there is a easy way to disable the I/D cache and BP? It seems some parameters of cache and bp can be adjusted in the configuration file but they cannot be totally disabled. Thank you so much :-)

shioyadan commented 6 months ago

Hello,

Basically, it is difficult to disable the I-cache, D-cache, or branch predictor in a simple way, and manual modification is required.

Simply disabling the I-cache and D-cache will greatly reduce the speed of RSD, since each access will reach the main memory. It is possible to replace the cache with a simple, one-cycle-accessible memory with some manual modifications.

Disabling branch prediction is more complicated. It is not obvious what is meant by disabling branch prediction. It is relatively easy to rewrite the predictor so that it always predicts untaken. It is difficult to modify the fetcher to halt instruction fetching at every branch instruction until the branch resolution, and that causes significant performance degradation.

If you can tell me what your goal is, I may be able to suggest a better alternative.

YuanPol commented 6 months ago

Hello,

Thanks for your reply. I am developing a performance simulator based on the timing database. The accurate cycle latency information obtained by the verilator will be used to develop this database. The estimated cycle consumption from my simulator will finally compare to the result from the verilator. Unfortunately, because of the limited time, I have no time to add the cache processing mechanism to my simulator. Thus the final performance result is not comparable with the verilator result.

Adding a memory and reconnecting some ports will be a possible way. I am writing to ask if there are some configurations that can disable the cache simply like writing some control registers. If yes, then less effort needed for me.

Thank you so much! :-)

shioyadan commented 6 months ago

In your case, a possible solution is to significantly increase the cache size. With such a setup, cache misses will not occur except for the first time for each line. Especially, for benchmarks like CoreMark and Dhrystone that repeatedly run the same loop, they will almost always hit the cache from the second run onwards. By comparing the performance of running the loop twice with that of running it once, you should achieve results similar to a scenario where every access is a cache hit.

YuanPol commented 6 months ago

Thank you for your suggestion, but it doesn't work in my case because my timing database granularity is only a instruction blocks with hundreds of instructions . It doesn't depends on the real context of the full program execution. If you have any other ideas, please tell me. If not, I can close it :-)

YuanPol commented 6 months ago

Thank you for your suggestions. now I will close it :)

shioyadan commented 6 months ago

I'm sorry for the late reply.

Have you tried pipeline visualization with Konata? The logs for the pipeline visualizer contain most of the information for each cycle of the core pipeline.

If you want per-instruction-block statistics, analyzing the log may help you. From this log, you can see when each instruction was fetched and committed.

The following is the format of the log. RSD's "make kanata" will generate a log for Konata; see the RSD's README. (Be careful, the command is "kanata".) https://github.com/shioyadan/Konata/blob/master/docs/kanata-log-format.md

By running Coremark more than once and extracting the information from the second run, as I suggested at the beginning, you may be able to determine the number of execution cycles in each instruction block when most accesses hit the caches.

By the way, I think, how to define the number of "executed" cycles consumed in a fine-grained instruction block is not easy. For example, if you use the difference in commit cycles, it may not reflect the effect of instruction cache misses.