Closed thomasrolinger closed 2 years ago
Thus far, even with the SoA approach, I can't seem to get performance gains with block swapping. While it does seem we are increasing the amount of local reads, the communication diagnostics is not reflecting as large of a change.
I'm not sure whether this coarse grained data remapping will provide enough of a benefit.
Another avenue is fine-grained data remapping. For a given A[i]
, put it on the locale that accesses it the most. This requires a more invasive inspector, similar to what we do for replication (i.e., count how many times we see a given element from a locale). It also will require using an indirection array to use the remapping. This would apply not only in the kernel but anywhere else in the program.
If we go down this path, we need to do it by hand to see if there are any gains.
There are a couple key things we have determined by looking into this optimization:
BlockDist
because it breaks some of the distribution's internal workings. Perhaps down the road we could figure out a way to address this. TheBlockCyclic
distribution allows for this but it is too slow to consider using (it also does not work with data replication due to not supportinglocalSubdomain()
). However, we can still useBlockDist
if we swap blocks, as that does not break anything internally.A
inA[B[i]]
is the array that is being iterated over by theforall
, then we cannot apply data remapping via block swapping. More details on this below. We can step around this for many of our applications if we use a struct-of-arrays approach rather than an array-of-structs.Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the
forall
LoopA big concern is when the array we are remapping is iterated over by the
forall
. This complicates the remapping because by swapping blocks, we are also "carrying over" that change to where theforall
iterations execute. If we saw that a given block has 0 local accesses and decide to swap it with another block, we will still have 0 local accesses because the source of the accesses is entirely determined by theforall
iterations.Right now, this happens specifically for PageRank and Moldyn.
Let's consider an example below from PageRank. Here are the access counts for 4 locales:
Each row represents access counts issued from a locale/block to the different blocks/locales. Initially, block
i
is on localei
. From this initial map, we see that Locale 2 does not issue any accesses to Block 2. This appears problematic because Block 2 is on Locale 2, so we have no local accesses. Furthermore, we see that Locale 1 issues relatively few accesses to its locale block, Block 1. It would seem like a good idea to swap Blocks 1 and 2. However, the array that these counts are for is iterated over by theforall
loop.By moving Block 1 to Locale 2, any accesses "from" Block 1's original locale (Locale 1) will now come from Locale 2, and vice versa for the move of Block 2 to Locale 1. This first step results in the following map, which is computed by swapping the rows.
Now the next step is to swap the columns, resulting in the final map:
This row/column swapping can be done in any order (i.e., we can swap the columns first and then the rows, or the other way around). But there is no net win that is discernible here in terms of reducing remote accesses.
Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the
forall
LoopWhen the array we are optimizing is NOT iterated over by the
forall
loop, then it is a more simple problem. In this case, we just need to swap columns and we can work towards reducing remote accesses.The diagonal of the counts matrix represents the local accesses. We want these to be as large as possible. We can swap columns and use something like the Hungarian algorithm to maximize the trace of the matrix.
Conjugate Gradient with Other Matrices?
The problem with NAS-CG is the sparsity pattern of the matrix generated by the benchmark is uniformly random, so there is no chance of block swapping having any affect. The HPCG benchmark uses 27-point stencils, so that doesn't help either. I looked at other real-world matrices and nothing is standing out. So I would say that NAS-CG does not have any potential wins here. But it will be a good use-case to show that we will not do the optimization when there are no obvious gains.