We can increase the Instruction Memory storage for each thread by demuxing the PC to 2 or more Instruction Memories, thus dividing the threads across them, to the extreme of having 8 such memories, so each thread can have the maximum addressable amount. We can then trivially remux the fetched instruction in-order.
Pro:
easy
parameterizable within an I Mem module
Con:
lowers I Mem usage efficiency, down to 1/8th duty cycle.
(could we get instruction fetch parallelism somehow?)
Lowers computational density (more BRAMs for no more computations/sec)
Thus, we might better increase the I Mem for a thread by either distributing the thread code over two or more hardware threads, or dividing the code across cores. Both would yield better density, and the second creates real concurrency in the thread.
We can increase the Instruction Memory storage for each thread by demuxing the PC to 2 or more Instruction Memories, thus dividing the threads across them, to the extreme of having 8 such memories, so each thread can have the maximum addressable amount. We can then trivially remux the fetched instruction in-order.
Pro:
Con:
Thus, we might better increase the I Mem for a thread by either distributing the thread code over two or more hardware threads, or dividing the code across cores. Both would yield better density, and the second creates real concurrency in the thread.