Closed sxu55 closed 6 years ago
It’s true that core is not optimised for performance, but neither will the bottleneck be where you think it should be based on ASIC timing considerations. In FPGA the BRAM will be reckoned to be relatively fast and the interconnection fabric is the slow part of the design. Consequently that simple diagram represents a model of what happens in the Chisel HDL and not the accurate behaviour. If you consult SiFive’s documentation for their Rocket based ASIC it achieves 1GHz I believe and this might be more to your liking.
Sent from my iPhone
On 13 Apr 2018, at 22:27, sxu55 notifications@github.com wrote:
From the diagram in http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/, it seems to me this pipeline might not hit a high frequency target. For example, the dmem.req.addr is directly wired from the output of the ALU and to DCache. In the DCache pipeline, the addr input is fed into a mux and then directly to SRAM (data/meta).
I thought generally in order to achieve high frequency (say, @ 1GHz), one need to squeeze a single cycle to hold address steady for reading the SRAM due to the physical constraints. If the meta and data have input addr flopped and then output the read result using some combination logic, the output path that is going through the associative search and reach s2 flops will create a timing problem. On the other hand, if the meta and data have output data flopped, then the addr hold time before the clk edge might not be long enough with such long combinational chain before the SRAM.
Though the DCache diagram depicts the meta and data as a "flop", I assume it is just to demonstrate the timing behavior will be the same as other flops between stage 1 and 2, while essentially they are still standard SRAM or blockRAM. Please correct me if my understanding of the pipeline is wrong, thanks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for the clarification. This makes me wondering if there are fundamental pipeline differences in ASIC version of rocket and lowrisc's version targeting for FPGA. From the code in Freechipsproject, I didn't see much change in the part I mentioned. Perhaps I missed something?
I did not say that we changed it. I said that the Rocket design was used by SiFive and is capable of >1GHz on ASIC, so your assertion that the architecture is wrong for 1GHz operation must be flawed when using ASIC timings. On FPGA it only runs at 25MHz, so BRAM timing is not critical path, interconnect delay is, which is only tangentially connected to the block architecture.
Sent from my iPhone
On 14 Apr 2018, at 20:58, sxu55 notifications@github.com wrote:
Thanks for the clarification. This makes me wondering if there are fundamental pipeline differences in ASIC version of rocket and lowrisc's version targeting for FPGA. From the code in Freechipsproject, I didn't see much change in the part I mentioned. Perhaps I missed something?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks a lot for your response. Will take a closer look at SRAM timing.
From the diagram in http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/, it seems to me this pipeline might not hit a high frequency target. For example, the dmem.req.addr is directly wired from the output of the ALU and to DCache. In the DCache pipeline, the addr input is fed into a mux and then directly to SRAM (data/meta).
I thought generally in order to achieve high frequency (say, @ 1GHz), one need to squeeze a single cycle to hold address steady for reading the SRAM due to the physical constraints. If the meta and data have input addr flopped and then output the read result using some combination logic, the output path that is going through the associative search and reach s2 flops will create a timing problem. On the other hand, if the meta and data have output data flopped, then the addr hold time before the clk edge might not be long enough with such long combinational chain before the SRAM.
Though the DCache diagram depicts the meta and data as a "flop", I assume it is just to demonstrate the timing behavior will be the same as other flops between stage 1 and 2, while essentially they are still standard SRAM or blockRAM and require a full cycle dedicated for read/write without other combinational logic. Please correct me if my understanding of the pipeline is wrong, thanks.