Weights are sparse. That's a big factor in dealing with most models. To take advantage of it you can compress and decompress weights on the fly. You can make pes smart and have them avoid computations that produce 0's. And you can make the memory system skip over zeros while maintaining the validity of the output. One thing you can also do is reorganize the order of filters in the weight tile so that when you do load an ifmap you're maximizing its use across filters with non-zero weights. This will reduce the energy access cost of ifmaps. For very sparse tiles we can offload their computation onto more intelligent PEs. The disadvantage is that this will complicate the writing of psums into psum memories since the write order will be different if you want to shuffle the filter order back to where it should be (assuming you have enough on board storage to hold the entire ofmap on chip). If you don't, you could do something super whacky. Shuffle the filter order throughout the entire network. If you shuffle (Fn, Fm) to (Fm, Fn) you will produce input channels Cn, Cm in the "wrong" order Cm, Cn. in the "wrong" order for the next layer, that's all fine if you always correlate with the right filter that may also be shuffled again. This persistent shuffling without reordering will affect the entire network. So then it becomes a question of what is the most appropriate filter shuffle that minimizes the ifmap access cost due to sparsity throughout the network while knowing that earlier networks will cause later networks to have downstream dependencies that need to be considered.
Weights are sparse. That's a big factor in dealing with most models. To take advantage of it you can compress and decompress weights on the fly. You can make pes smart and have them avoid computations that produce 0's. And you can make the memory system skip over zeros while maintaining the validity of the output. One thing you can also do is reorganize the order of filters in the weight tile so that when you do load an ifmap you're maximizing its use across filters with non-zero weights. This will reduce the energy access cost of ifmaps. For very sparse tiles we can offload their computation onto more intelligent PEs. The disadvantage is that this will complicate the writing of psums into psum memories since the write order will be different if you want to shuffle the filter order back to where it should be (assuming you have enough on board storage to hold the entire ofmap on chip). If you don't, you could do something super whacky. Shuffle the filter order throughout the entire network. If you shuffle (Fn, Fm) to (Fm, Fn) you will produce input channels Cn, Cm in the "wrong" order Cm, Cn. in the "wrong" order for the next layer, that's all fine if you always correlate with the right filter that may also be shuffled again. This persistent shuffling without reordering will affect the entire network. So then it becomes a question of what is the most appropriate filter shuffle that minimizes the ifmap access cost due to sparsity throughout the network while knowing that earlier networks will cause later networks to have downstream dependencies that need to be considered.