Improve communication for doFFT_X()

ronawho commented 5 years ago

doFFT_X() has a lot of communication here:

https://github.com/npadmana/DistributedFFT/blob/5ba696e5bc364e350518808d22af79611d2c76ee/src/DistributedFFT.chpl#L205-L212

Our bulk transfer implementation has a fair number of GETs to acquire metadata, which we might be able improve by using:

unordered GETs
reusing radCache and doing a single larger GET
something else?

The current block bulk transfer code also turns something like:

myplane = arr[{xRange,j..j,zRange}];

into something roughly like:

coforall loc in owningLocales(arr[{xRange,j..j,zRange}]) do on loc do {
  myplane[computeChunk()] = arr[{xRange,j..j,zRange}].localSection()
}

(basically take something that looks like it should be a GET, and make a coforall+on plus PUTs. I wonder if all that remote task creation ends up overwhelming nodes with tasks.

We've also been wondering if we can turn the PUT into a non-blocking operation so that we an overlap the PUT and the next iteration's GET.

ronawho commented 5 years ago

Here's a diff against chapel that turns the coforall-on+PUTs into just GETs:

diff --git a/modules/dists/BlockDist.chpl b/modules/dists/BlockDist.chpl
index 00dc22a9..206e49d4 100644
--- a/modules/dists/BlockDist.chpl
+++ b/modules/dists/BlockDist.chpl
@@ -1489,7 +1489,6 @@ where !disableBlockDistBulkTransfer {
     writeln("In BlockDist.doiBulkTransferToKnown(DefaultRectangular)");

   coforall j in dom.dist.activeTargetLocales(srcDom) {
-    on dom.dist.targetLocales(j) {
       const Src = if _privatization then chpl_getPrivatizedCopy(this.type, pid) else this;
       const inters = Src.dom.locDoms(j).myBlock[srcDom];

@@ -1500,7 +1499,6 @@ where !disableBlockDistBulkTransfer {

       const elemActual = Src.locArr[j].myElems._value;
       chpl__bulkTransferArray(Dest, destChunk, elemActual, inters);
-    }
   }

   return true;
@@ -1513,7 +1511,6 @@ where !disableBlockDistBulkTransfer {
     writeln("In BlockArr.doiBulkTransferFromKnown(DefaultRectangular)");

   coforall j in dom.dist.activeTargetLocales(destDom) {
-    on dom.dist.targetLocales(j) {
       // Grab privatized copy of 'this' to avoid extra GETs
       const Dest = if _privatization then chpl_getPrivatizedCopy(this.type, pid) else this;
       const inters = Dest.dom.locDoms(j).myBlock[destDom];
@@ -1526,7 +1523,6 @@ where !disableBlockDistBulkTransfer {

       const elemActual = Dest.locArr[j].myElems._value;
       chpl__bulkTransferArray(elemActual, inters, Src, srcChunk);
-    }
   }

   return true;

This takes 8 node size D from 66s to 60s. I tried making the coforall a serial loop, figuring that all cores all running this already, but for some reason that has a ~2x perf hit. It'd be interesting to see if we can make chpl__bulkTransferArray non-blocking and evaluate the performance of that.

ronawho commented 5 years ago

The serial loop ran slower because all locales are doing GETs from other locales in lockstep (i.e. everybody is getting from locale 0, then 1, then 2, etc. We can do some offsetting to improve that with something like:

diff --git a/modules/dists/BlockDist.chpl b/modules/dists/BlockDist.chpl
index 19c548bf89..28a59cd29e 100644
--- a/modules/dists/BlockDist.chpl
+++ b/modules/dists/BlockDist.chpl
@@ -1489,19 +1489,20 @@ where !disableBlockDistBulkTransfer {
   if debugBlockDistBulkTransfer then
     writeln("In BlockDist.doiBulkTransferToKnown(DefaultRectangular)");

-  coforall j in dom.dist.activeTargetLocales(srcDom) {
-    on dom.dist.targetLocales(j) {
+  var activeLocs = dom.dist.activeTargetLocales(srcDom);
+
+  for j in activeLocs.domain {
+      const j_offset = ((j + here.id) % activeLocs.domain.size) + activeLocs.domain.low;
       const Src = if _privatization then chpl_getPrivatizedCopy(this.type, pid) else this;
-      const inters = Src.dom.locDoms(j).myBlock[srcDom];
+      const inters = Src.dom.locDoms(activeLocs[j_offset]).myBlock[srcDom];

       const destChunk = bulkCommTranslateDomain(inters, srcDom, destDom);

       if debugBlockDistBulkTransfer then
         writeln("  A[",destChunk,"] = B[",inters,"]");

-      const elemActual = Src.locArr[j].myElems._value;
+      const elemActual = Src.locArr[activeLocs[j_offset]].myElems._value;
       chpl__bulkTransferArray(Dest, destChunk, elemActual, inters);
-    }
   }

   return true;

With that I see timings equivalent to what's on master with the coforall+on pattern.

npadmana commented 5 years ago

I'm going to hold off on making these changes on my version until you tell me that we're ready for them.

ronawho commented 5 years ago

Agreed, I don't think there's anything for you to do here -- just letting you know what we've been looking into.

ronawho commented 4 years ago

Closing, this is way out of date and comm has been optimized.

npadmana / DistributedFFT

Improve communication for doFFT_X() #4