Closed laforest closed 11 years ago
Another alternative would be giving up the idea of using a solid block of memory for PC.
Pipeline the PC for 7 stages.
At Ctrl0, a 2:1 mux that picks OP or JMP controlled by I/O ready bit
At Ctrl1, a 2:1 mux that picks D or the previous PC, also controlled by the I/O ready bit and of course, increment the PC as well
Some concern would be routing and area for the PC pipeline, but I doubt it would add too much... Also, it is not as easy to add more threads as the original design. But Octavo is pretty squeeze with memory sharing between 8 threads, that shouldn't be a big concern.
Stringing out the PC memory as a pipeline itself. Never thought of that. :)
The area impact would be approx. (on Stratix IV): 8x10 bits -> 80 registers -> 40 ALMs, vs. the current 10 ALMs of 1 MLAB. Routing shouldn't be a problem given the long pipeline. Also, since the Cyclone IV don't have MLABs, it would port well. (Cyclone V has MLABs though...)
The main problem I see is that without a block of memory for the PCs, we can't load their initial values from a file as easily. It would take some indirection like instantiating a block memory, but with no readers/writers, and load the PC registers in an initial block, just like in the Address_Translator module.
You did make me remember something else though: for a vanilla 36-bit Octavo, only 10 of the 20 bits in each MLAB word is used, so I could store "PC-1" in the other 10 bits, removing the subtractor altogether!
I'll look into that.
Fixed, but not fully tested yet in Predication branch.
When an instruction is annulled due to non-ready I/O, we subtract one from the issued PC to re-issue the annulled instruction.
Unfortunately, there is no slack between the Controller and the I_mem, dropping Fmax from ~550 to ~430. Adding two stages of PC_PIPELINE restores speed, but now we have 8 threads running around a 10-stage control pipeline. Not sure of the impact of that.
Alternately: rather than using 20 FFs to pipeline the PC more, what about using 1 MLAB (10 ALMs) to store the one-behind PC to re-issue the instruction? A 2:1 mux, controlled by IO_ready, should be fast enough and it moves the subtractor out of that critical path to I_mem?
But will it affect the future port to Cyclone? (See #6)