There are a couple key things we have determined by looking into this optimization:

We are limited to only swapping blocks. We cannot oversubscribe the locales with blocks when using BlockDist because it breaks some of the distribution's internal workings. Perhaps down the road we could figure out a way to address this. The BlockCyclic distribution allows for this but it is too slow to consider using (it also does not work with data replication due to not supporting localSubdomain()). However, we can still use BlockDist if we swap blocks, as that does not break anything internally.
If the array A in A[B[i]] is the array that is being iterated over by the forall, then we cannot apply data remapping via block swapping. More details on this below. We can step around this for many of our applications if we use a struct-of-arrays approach rather than an array-of-structs.

Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the `forall` Loop

A big concern is when the array we are remapping is iterated over by the forall. This complicates the remapping because by swapping blocks, we are also "carrying over" that change to where the forall iterations execute. If we saw that a given block has 0 local accesses and decide to swap it with another block, we will still have 0 local accesses because the source of the accesses is entirely determined by the forall iterations.

Right now, this happens specifically for PageRank and Moldyn.

Let's consider an example below from PageRank. Here are the access counts for 4 locales:

	Block 0/Locale 0	Block 1/Locale 1	Block 2/Locale 2	Block 3/Locale 3
Locale 0/Block 0	135734	134711	135734	133687
Locale 1/Block 1	134711	4092	133688	3068
Locale 2/Block 2	135734	133688	0	1536
Locale 3/Block 3	133687	3068	1536	3068

Each row represents access counts issued from a locale/block to the different blocks/locales. Initially, block i is on locale i. From this initial map, we see that Locale 2 does not issue any accesses to Block 2. This appears problematic because Block 2 is on Locale 2, so we have no local accesses. Furthermore, we see that Locale 1 issues relatively few accesses to its locale block, Block 1. It would seem like a good idea to swap Blocks 1 and 2. However, the array that these counts are for is iterated over by the forall loop.

By moving Block 1 to Locale 2, any accesses "from" Block 1's original locale (Locale 1) will now come from Locale 2, and vice versa for the move of Block 2 to Locale 1. This first step results in the following map, which is computed by swapping the rows.

	Block 0/Locale 0	Block 1/Locale 1	Block 2/Locale 2	Block 3/Locale 3
Locale 0/Block 0	135734	134711	135734	133687
Locale 1/Block 2	135734	133688	0	1536
Locale 2/Block 1	134711	4092	133688	3068
Locale 3/Block 3	133687	3068	1536	3068

Now the next step is to swap the columns, resulting in the final map:

	Block 0/Locale 0	Block 2/Locale 1	Block 1/Locale 2	Block 3/Locale 3
Locale 0/Block 0	135734	135734	134711	133687
Locale 1/Block 2	135734	0	133688	1536
Locale 2/Block 1	134711	133688	4092	3068
Locale 3/Block 3	133687	1536	3068	3068

This row/column swapping can be done in any order (i.e., we can swap the columns first and then the rows, or the other way around). But there is no net win that is discernible here in terms of reducing remote accesses.

Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the `forall` Loop

When the array we are optimizing is NOT iterated over by the forall loop, then it is a more simple problem. In this case, we just need to swap columns and we can work towards reducing remote accesses.

The diagonal of the counts matrix represents the local accesses. We want these to be as large as possible. We can swap columns and use something like the Hungarian algorithm to maximize the trace of the matrix.

Conjugate Gradient with Other Matrices?

The problem with NAS-CG is the sparsity pattern of the matrix generated by the benchmark is uniformly random, so there is no chance of block swapping having any affect. The HPCG benchmark uses 27-point stencils, so that doesn't help either. I looked at other real-world matrices and nothing is standing out. So I would say that NAS-CG does not have any potential wins here. But it will be a good use-case to show that we will not do the optimization when there are no obvious gains.

thomasrolinger / chapel

Design/plan data remapping optimization #24

Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the `forall` Loop

Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the `forall` Loop

Conjugate Gradient with Other Matrices?

thomasrolinger / chapel

Design/plan data remapping optimization #24

Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the forall Loop

Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the forall Loop

Conjugate Gradient with Other Matrices?

Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the `forall` Loop

Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the `forall` Loop