thomasrolinger / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
0 stars 0 forks source link

Design/plan data remapping optimization #24

Closed thomasrolinger closed 2 years ago

thomasrolinger commented 2 years ago

There are a couple key things we have determined by looking into this optimization:

  1. We are limited to only swapping blocks. We cannot oversubscribe the locales with blocks when using BlockDist because it breaks some of the distribution's internal workings. Perhaps down the road we could figure out a way to address this. The BlockCyclic distribution allows for this but it is too slow to consider using (it also does not work with data replication due to not supporting localSubdomain()). However, we can still use BlockDist if we swap blocks, as that does not break anything internally.
  2. If the array A in A[B[i]] is the array that is being iterated over by the forall, then we cannot apply data remapping via block swapping. More details on this below. We can step around this for many of our applications if we use a struct-of-arrays approach rather than an array-of-structs.

Understanding the Implications of Swapping Blocks When the Array is Iterated Over by the forall Loop

A big concern is when the array we are remapping is iterated over by the forall. This complicates the remapping because by swapping blocks, we are also "carrying over" that change to where the forall iterations execute. If we saw that a given block has 0 local accesses and decide to swap it with another block, we will still have 0 local accesses because the source of the accesses is entirely determined by the forall iterations.

Right now, this happens specifically for PageRank and Moldyn.

Let's consider an example below from PageRank. Here are the access counts for 4 locales:

Block 0/Locale 0 Block 1/Locale 1 Block 2/Locale 2 Block 3/Locale 3
Locale 0/Block 0 135734 134711 135734 133687
Locale 1/Block 1 134711 4092 133688 3068
Locale 2/Block 2 135734 133688 0 1536
Locale 3/Block 3 133687 3068 1536 3068

Each row represents access counts issued from a locale/block to the different blocks/locales. Initially, block i is on locale i. From this initial map, we see that Locale 2 does not issue any accesses to Block 2. This appears problematic because Block 2 is on Locale 2, so we have no local accesses. Furthermore, we see that Locale 1 issues relatively few accesses to its locale block, Block 1. It would seem like a good idea to swap Blocks 1 and 2. However, the array that these counts are for is iterated over by the forall loop.

By moving Block 1 to Locale 2, any accesses "from" Block 1's original locale (Locale 1) will now come from Locale 2, and vice versa for the move of Block 2 to Locale 1. This first step results in the following map, which is computed by swapping the rows.

Block 0/Locale 0 Block 1/Locale 1 Block 2/Locale 2 Block 3/Locale 3
Locale 0/Block 0 135734 134711 135734 133687
Locale 1/Block 2 135734 133688 0 1536
Locale 2/Block 1 134711 4092 133688 3068
Locale 3/Block 3 133687 3068 1536 3068

Now the next step is to swap the columns, resulting in the final map:

Block 0/Locale 0 Block 2/Locale 1 Block 1/Locale 2 Block 3/Locale 3
Locale 0/Block 0 135734 135734 134711 133687
Locale 1/Block 2 135734 0 133688 1536
Locale 2/Block 1 134711 133688 4092 3068
Locale 3/Block 3 133687 1536 3068 3068

This row/column swapping can be done in any order (i.e., we can swap the columns first and then the rows, or the other way around). But there is no net win that is discernible here in terms of reducing remote accesses.

Understanding the Implications of Swapping Blocks When the Array is NOT Iterated Over by the forall Loop

When the array we are optimizing is NOT iterated over by the forall loop, then it is a more simple problem. In this case, we just need to swap columns and we can work towards reducing remote accesses.

The diagonal of the counts matrix represents the local accesses. We want these to be as large as possible. We can swap columns and use something like the Hungarian algorithm to maximize the trace of the matrix.

Conjugate Gradient with Other Matrices?

The problem with NAS-CG is the sparsity pattern of the matrix generated by the benchmark is uniformly random, so there is no chance of block swapping having any affect. The HPCG benchmark uses 27-point stencils, so that doesn't help either. I looked at other real-world matrices and nothing is standing out. So I would say that NAS-CG does not have any potential wins here. But it will be a good use-case to show that we will not do the optimization when there are no obvious gains.

thomasrolinger commented 2 years ago

Thus far, even with the SoA approach, I can't seem to get performance gains with block swapping. While it does seem we are increasing the amount of local reads, the communication diagnostics is not reflecting as large of a change.

I'm not sure whether this coarse grained data remapping will provide enough of a benefit.

Another avenue is fine-grained data remapping. For a given A[i], put it on the locale that accesses it the most. This requires a more invasive inspector, similar to what we do for replication (i.e., count how many times we see a given element from a locale). It also will require using an indirection array to use the remapping. This would apply not only in the kernel but anywhere else in the program.

If we go down this path, we need to do it by hand to see if there are any gains.