Striped multiport fifo - Githubissues

lekcyjna123 commented 1 year ago

Currently we have implemented a MultiportFifo, that gives very strong guarantees:

all elements are in time order
always maximum subset of methods is usable

But this comes at cost of exponential complexity, and sometimes we can live with weaker guarantees:

all elements are in time order
always maximum subset of methods which operates on elements from the same stripe are ready.

As an example, lets take 2 port fifo with content:

cycle 1: a, b
cycle 2: c, d

Now we start to pop elements. If we pop from such queue one element we get in first cycle a or b. Let assume that we got a, on the beginning of the second cycle of popping we have:

cycle 1: b
cycle 2: c, d

MultiportFifo guarantees that we can pop now two elements b and either c or d. In StrippedMultiportFifo it will be enough if we get only b (so second port will be unused).

Such structure can be useful to pass elements between pipeline elements if we care about order between cycles, but don't care about full pipeline usage.

tilk commented 1 year ago

If you don't care about throughput, why not use a normal FIFO?

The thing you are suggesting will have pretty weird performance behavior, and the only way to reduce the weirdness is to increase the number of internal queues while keeping the number of ports.

Also, did you mean "striped"?

lekcyjna123 commented 1 year ago

If you don't care about throughput, why not use a normal FIFO?

Because in some cases we can optimistically assume, that nearly always all output port will be ready if at least one is ready. In such case using normal FIFO will be not useful because it will always limit to only 1 port instead. Additionally we can not use simply n different normal FIFO because we can need to keep in order data which are send in different cycles.

One example about which I thought while creating this issue: I can get an vector instruction, which need to be translated to 24 simpler ones. Original instruction do widening addition data in 8 registers simultaneously. So it have to be translated to:

widen rs1
widen rs2
add rs1 and rs2 So for each register pair this instructions have to be ordered (I have to first widen and after that add data), but job can be parallelised over all 8 destination registers. So I get 3 batches each of 8 instructions. I don't care about order of execution inside a batch but order between batches is important, so each batch can be a stripe in fifo.

The thing you are suggesting will have pretty weird performance behavior, and the only way to reduce the weirdness is to increase the number of internal queues while keeping the number of ports.

I don't understand why performance behaviour will be weird? In optimistic case whole stripe will be read at once. If there are some stalls and the stripe can not be read at once, then we have to split it to parts and first we process as much data as possible and in next cycle rest.

Also, did you mean "striped"?

Fixed

EDIT: I solved my issue with vector registers without striped queue, but I think that this can be useful in future, because patter seems generic.

kuznia-rdzeni / coreblocks

Striped multiport fifo #428