3x3 Arch broadcast vs smart line buffers

There are two options available when implementing the 3x3 arch. One implementation involves a series of broadcasts over groups of 3x3 pes. This has the benefit of being simple in terms of the ifmap program, and it doesn't require any special timing adjustment registers. The disadvantage is that it introduces dead cycles in each ifmap line where the computed psum out is garbage. That psum is garbage because it represents a 3x3 kernel position that doesn't exist when doing a convolution sweep. These dead cycles hurt utilization and their output has to be ignored by the psum memory. An alternative is to implement a timing adjustment scheme in the smart line buffers that (when coupled with a wider sram rows in ifmap memory) guarantee no drop in utilization.

So the advantage of the broadcast is that it reduces the number of unique access per unrolled kernel group. This simplifies the memory hierarchy and makes descriptor generation incredibly easy. You don't have to do this. You can also distribute the ifmap pixels across the banks in a way that guarantees that all required unique access per cycle are distributed in such a way that allows them to be available at every cycle they are needed. The negatives of this approach are that the distribution across banks greatly complicates the descriptor generation process because the access expressions required to cease to be affine. In fact I'm not even sure that with the banking distribution I shoddily put together you can generalize any of the accesses to the bank with a single descriptor. This is very similar to the age-old banking algo question, where you want to minimize storage/ data duplication but not make the access pattern so complicated that you need another memory to store the mapping of where stuff is on-chip (that memory could be, in the worse case, as big as the storage you need for the original data). This problem of distributing stuff successfully will complicate data movement regardless of what level of memory you are at. You can avoid this problem with the broadcast trick and pay with a small drop in utilization.

neu-ece-esl / cnn_processor_model

3x3 Arch broadcast vs smart line buffers #3