kuznia-rdzeni / coreblocks

RISC-V out-of-order core for education and research purposes
https://kuznia-rdzeni.github.io/coreblocks/
BSD 3-Clause "New" or "Revised" License
37 stars 15 forks source link

Striped multiport fifo #428

Open lekcyjna123 opened 1 year ago

lekcyjna123 commented 1 year ago

Currently we have implemented a MultiportFifo, that gives very strong guarantees:

But this comes at cost of exponential complexity, and sometimes we can live with weaker guarantees:

As an example, lets take 2 port fifo with content:

Now we start to pop elements. If we pop from such queue one element we get in first cycle a or b. Let assume that we got a, on the beginning of the second cycle of popping we have:

MultiportFifo guarantees that we can pop now two elements b and either c or d. In StrippedMultiportFifo it will be enough if we get only b (so second port will be unused).

Such structure can be useful to pass elements between pipeline elements if we care about order between cycles, but don't care about full pipeline usage.

tilk commented 1 year ago

If you don't care about throughput, why not use a normal FIFO?

The thing you are suggesting will have pretty weird performance behavior, and the only way to reduce the weirdness is to increase the number of internal queues while keeping the number of ports.

Also, did you mean "striped"?

lekcyjna123 commented 1 year ago

If you don't care about throughput, why not use a normal FIFO?

Because in some cases we can optimistically assume, that nearly always all output port will be ready if at least one is ready. In such case using normal FIFO will be not useful because it will always limit to only 1 port instead. Additionally we can not use simply n different normal FIFO because we can need to keep in order data which are send in different cycles.

One example about which I thought while creating this issue: I can get an vector instruction, which need to be translated to 24 simpler ones. Original instruction do widening addition data in 8 registers simultaneously. So it have to be translated to:

The thing you are suggesting will have pretty weird performance behavior, and the only way to reduce the weirdness is to increase the number of internal queues while keeping the number of ports.

I don't understand why performance behaviour will be weird? In optimistic case whole stripe will be read at once. If there are some stalls and the stripe can not be read at once, then we have to split it to parts and first we process as much data as possible and in next cycle rest.

Also, did you mean "striped"?

Fixed


EDIT: I solved my issue with vector registers without striped queue, but I think that this can be useful in future, because patter seems generic.