Perhaps some blocking strategy, such as that used for matrix multiplications, would be
advantageous: when getting from the history buffer, we can load history for N nodes (where
N is just less than what fills the L3 cache) & increment the coupling vector by the
contributions of those nodes, then continue with next chunk of N nodes.
(thx to @i-Zaak for the idea)
Perhaps some blocking strategy, such as that used for matrix multiplications, would be advantageous: when getting from the history buffer, we can load history for N nodes (where N is just less than what fills the L3 cache) & increment the coupling vector by the contributions of those nodes, then continue with next chunk of N nodes.