Mega Project: Implement pipelined CPU architecture

MJoergen commented 4 years ago

The benefit promised by this change will be:

Increasing CPU clock frequency to 100 MHz and beyond.
Roughly halving the number of clock cycles per instruction executed.

In total, this rewrite could increase the MIPS performance by a factor of 3-4. Promises, promises, promises!!!!

Examples:

INCRB : This instruction will execute in a single clock cycle.
ADD 0, R0 : This instruction takes two words of instruction memory. and therefore takes two clock cycles.
ADD R1, R0 : This instruction takes only one clock cycle.
ADD @R1, @R0 : This instruction does two reads from and one write to data memory, so a total of three clock cycles.

Anyway, this will be a complete rewrite of the CPU, essentially starting from scratch. Things to include in the rewrite are:

Going from a von Neumann architecture to a Harvard architecture, i.e. having a separate data and instruction bus. This allows the CPU to start fetching the next instruction while accessing the memory for the current instruction. Essentially, this doubles the memory bandwidth. To implement this change we need to update the mmio_mux as well as block_ram and block_rom modules.

I have not done anything like this before, so I'm not sure how to do it, but here is a MIT university lecture on pipelined processors.

This change should not affect the emulator in any way, since the emulator is not cycle exact (at the time of writing).

I've tentatively labelled this issue V1.8, since we already have a lot of changes scheduled for V1.7. Let me hear your thoughts on this!

bernd-ulmann commented 4 years ago

Pipelining QNICE is a really great idea. I am curious how you will implement it in the end. As of now I see the main challenge in dealing with data dependencies. This is much simpler with pure RISC architectures where memory accesses are only done by Load/Store instructions while we can have this problem everywhere where a @Rxx... occurs.

MJoergen commented 4 years ago

I am curious too! Indeed, I don't know (yet) how to implement the pipelining, or whether it is even possible given all the data dependencies ...

sy2002 commented 4 years ago

Cool stuff 😃 I guess I am interessted to be part of this, too 😁 When starting into this venture, we might want to do a larger online meeting, where we are throwing thoughts about the architecture and stuff. Plus: Today's architecture is very probably already capable of doing 100 MHz, so we should shoot for at least 400 MHz or more, which can be generated using the MMCM.

Of course, then also other part of the system need to be optimized for higher speeds beforehand.

So I would suggest that we try to get rid of all these "hardcoded speeds" during V1.7 and all other "obstacles"

sy2002 commented 4 years ago

One more thought: When beginning to tackle this in future, we need a CPU test that "provokes" some nasty situations and maybe even errors in the pipeline to get everything rock solid. Some very well thoght "mean" and "blackguardly" code that on the first glance looks harmless 😈 👺 👹 and that hurts really bad :hurtrealbad:

MJoergen commented 4 years ago

I agree. Even though we already have an extensive functional CPU test, we need something that is even more evil!

MJoergen commented 4 years ago

so we should shoot for at least 400 MHz or more

Just a little note to adjust our expectations.

Vivado is a great tool to analyze timing within a design, and to find long combinatorial paths. SImply use the menu "Reports -> Timing -> Report Timing Summary", and when that is done then in the result window select "Intra-Clock Paths -> SLOW_CLOCK" and click on the numeric value after "Setup". This will give a list of the longest paths in the CPU clock domain. Then you can either select a path and press F4 to get a schematic view, or you can double-click on a path and get the detailed propagation delays through every path of the path.

One such long path in the CPU is from the Register File, through the ALU, and into the Status Register. Currently, this path takes around 9 ns, which is very close to the 10 ns limit when running 50 MHz (and using both rising and fallind clock edges as we do).

In a pipelined architecture, this long path will be split into several smaller parts, by adding pipeline registers. It would seem reasonable to add a register after the Register File and another after the ALU. This will split the path into three parts. The longest part of this path is still inside the ALU, where the calculation of the X flag requires the complete ALU result. This part alone takes around 4.5 ns, i.e half the total longest path right now.

So based on this single observation it would seem like an upper bound for a fully pipelined processor is 100 MHz CPU clock, since that leaves 5 ns between the rising and falling clock edge.

However, if we can somehow redesign the architecture to only use rising edge that could potentially double the frequency, but would probably require additional pipeline stages.

All this to say that getting the CPU to work at 100 Mhz will be a great achievement in itself, and more than that will require some additional architectural redesign.

sy2002 commented 4 years ago

Thank you for this thorough analysis, Michael. Highly appreciated. Then let's shoot for 100 MHz in V1.8 and stay at 50 MHz in V1.7. (I did remove the 100 MHz goal from the V1.7 scope in issues #80 and #95, but we might want to keep the "get rid of the hardcoded speeds" in #98 in V1.7 as a preparation for V1.8)

sy2002 / QNICE-FPGA

Mega Project: Implement pipelined CPU architecture #94