CpuInterfaceRV is stateful

petervdonovan commented 1 year ago

Comment lifted from discussion in the pret-noc paper:

Currently I believe that ''blocking'' means that the hardware continues as usual, except for the fact that future usages of the NoC will silently fail because the NoC is stuck in this ``blocking'' state. Why do we do this?

The blocking state seems useless to me. I am not sure about the other states, but if we could get rid of them then that could make concurrent accesses to the NoC by different threads on the same core less complicated.

erlingrj commented 1 year ago

Could you elaborate a little, Peter? Are you referring to that the sending FIFO could be full?

petervdonovan commented 1 year ago

Well, there are a few slightly different issues that are all related to the stateful behavior of the CPU interface.

For reference, this is an example of the use case that I have in mind (copied from here):

[def (do-send)
  {
    [def (others) [memregion t0 t1 noc-base-address words 2 to 16 by 1]]
    [def (scattering) [cyclify a0 a1 others times 5]]
    [def (zipped) [zip
      t4 of [rng-seeded t2 t3 bits 5 length 16 seeded 7]
      and t6 of scattering]]
    [[nonzero-team pariterate-at-rate] {rand,addr} of zipped doing 4 per 42 cycles
      {
        sw [fst rand,addr] [snd rand,addr]
      }]
  }]

The idea here is that in order to fully utilize the available TDM slots when performing a scatter operation, we might want to send to all the cores all at once.

In detail, this code zips together a stream of repeating MMIO addresses with a stream of random numbers. All 8 threads on the sending core concurrently iterate over different parts of this zipped stream so that they can write the random numbers to the addresses. Each thread only does 4 iterations every 42 thread cycles, so this doesn't really maximize the utilization of the TDM slots, but if the number of cores were greater, or if the number of threads were smaller, or if the source of numbers were something faster than an RNG, such as just copying an array, then this method of writing to all the other cores (or in this case, cores 2 to 16) at once would be necessary to use the resource efficiently.

For this to work and be efficient, we would need to have the MMIO address specify the destination core instead of specifying the destination core in a separate write. More generally, we would not want the interface to be a state machine because then the different threads which are trying to send messages to different destination cores will interfere with each other. Having the address and data both in one store instruction makes sending a word on the NoC an atomic operation.

We can also reduce interference between different threads sending to different destination cores by removing the FIFO, which has "entries" containing both data and a destination core, and just writing directly to the split buffers. The problem with the FIFO is that because it imposes an order on the words that can be sent, it is a source of timing interference between send operations performed by different threads that have different destination cores. If the entry at the end of the FIFO is waiting for its TDM slot, then the whole FIFO has to wait, even as the other TDM slots may be coming and going.

I have already prototyped an implementation of these ideas that works with 16 cores. I can prepare PRs if we agree that this is wise.

schoeberl commented 1 year ago

Fully agree. I had this idea also in my mind to have the destination address as part of the write address. This would make the send operation an atomic operation. We can do this.

On the interference between different destinations on a single FIFO, I've added a split FIFO to the NI to avoid head-of-light blocking. This should be the default configuration. I would really like to have this in here and was fighting about 2 hours today with PGP signing and uploading to Maven, but did not succeed. Maybe see the actual source of S4NOC for the NI changes. I have also a paper submitted to DSD that I can share.

Cheers, Martin

petervdonovan commented 1 year ago

I've added a split FIFO to the NI to avoid head-of-light blocking.

If the split FIFOs never fill up, then the single FIFO that comes before the split FIFOs will not do anything, and if they do fill up, then we will have same timing interference that I mentioned, right? I am concerned that the only case in which the single FIFO is useful is the case when something has already gone wrong.

schoeberl commented 1 year ago

I think the split FIFOs are basically needed to avoid the so called head-of-line blocking - when one packet cannot get out in time as another is sitting in front for a later slot. Therefore, these split FIFOs need only be shallow. For equal traffic just a single word for one TDM round. However, this all depends on the traffic pattern. One can make up traffic patterns where there are more packets at one channel then another.

petervdonovan commented 1 year ago

I think the split FIFOs are basically needed to avoid the so called head-of-line blocking - when one packet cannot get out in time as another is sitting in front for a later slot. Therefore, these split FIFOs need only be shallow. For equal traffic just a single word for one TDM round. However, this all depends on the traffic pattern. One can make up traffic patterns where there are more packets at one channel then another.

I agree with all of this, but I do not see how it answers my question.

I guess one answer would be that we want to support both the use case where we are scattering to multiple destination cores, and the use case where we are sending to just one. The split FIFOs are good for the former of these, and the combined FIFO is good for the latter, so the hybrid approach covers both use cases.

But we can think about how important the latter use case is. If we believe that data must be consumed at exactly the same rate as it is produced, then the latter use case is not affected by the depth of the FIFO as long as it has a depth of at least one or two. It only helps when we want to be able to send a message of length greater than one word but less than or equal to the capacity of the FIFO, which allows for asynchrony between the sender and receiver, which might turn out to be really useful. I'm not sure.

lf-lang / interpret

CpuInterfaceRV is stateful #19