Adder between Controller and I_mem is critical path

laforest / Octavo

Verilog FPGA Parts Library. Old Octavo soft-CPU project.

http://fpgacpu.ca/

Other

73 stars 14 forks source link

Adder between Controller and I_mem is critical path #28

Closed laforest closed 11 years ago

laforest commented 11 years ago

When an instruction is annulled due to non-ready I/O, we subtract one from the issued PC to re-issue the annulled instruction.

Unfortunately, there is no slack between the Controller and the I_mem, dropping Fmax from ~550 to ~430. Adding two stages of PC_PIPELINE restores speed, but now we have 8 threads running around a 10-stage control pipeline. Not sure of the impact of that.

Alternately: rather than using 20 FFs to pipeline the PC more, what about using 1 MLAB (10 ALMs) to store the one-behind PC to re-issue the instruction? A 2:1 mux, controlled by IO_ready, should be fast enough and it moves the subtractor out of that critical path to I_mem?

But will it affect the future port to Cyclone? (See #6)

siupakm commented 11 years ago

Another alternative would be giving up the idea of using a solid block of memory for PC.

Pipeline the PC for 7 stages.
At Ctrl0, a 2:1 mux that picks OP or JMP controlled by I/O ready bit

or pack this into JMP? at Ctrl1?

At Ctrl1, a 2:1 mux that picks D or the previous PC, also controlled by the I/O ready bit and of course, increment the PC as well

Some concern would be routing and area for the PC pipeline, but I doubt it would add too much... Also, it is not as easy to add more threads as the original design. But Octavo is pretty squeeze with memory sharing between 8 threads, that shouldn't be a big concern.

laforest commented 11 years ago

Stringing out the PC memory as a pipeline itself. Never thought of that. :)

The area impact would be approx. (on Stratix IV): 8x10 bits -> 80 registers -> 40 ALMs, vs. the current 10 ALMs of 1 MLAB. Routing shouldn't be a problem given the long pipeline. Also, since the Cyclone IV don't have MLABs, it would port well. (Cyclone V has MLABs though...)

The main problem I see is that without a block of memory for the PCs, we can't load their initial values from a file as easily. It would take some indirection like instantiating a block memory, but with no readers/writers, and load the PC registers in an initial block, just like in the Address_Translator module.

You did make me remember something else though: for a vanilla 36-bit Octavo, only 10 of the 20 bits in each MLAB word is used, so I could store "PC-1" in the other 10 bits, removing the subtractor altogether!

I'll look into that.

laforest commented 11 years ago

Fixed, but not fully tested yet in Predication branch.