Closed npadmana closed 5 years ago
@ronawho -- FYI.
Nice!
We should be able to auto transition by changing if(doPlaneYZ)
to if(xRange.size >= here.maxTaskPar)
The code growth is a little unfortunate, but I think there's huge benefit from giving fftw bigger chunks to perform on. I'm guessing we'll need to do so something similar for the X transpose too, and end up with a pencil-master hybrid after all.
Unfortunately, a simple auto-transition doesn’t work. For instance, at 64 nodes for E, there are 32 YZ planes per node, which is less than the total number of cores. But you still win by doing this using planes. That’s why I went to a manual switch.
I agree that we could likely win a little more by doing something similar with x.
On Sat, Sep 28, 2019 at 6:58 AM Elliot Ronaghan notifications@github.com wrote:
Nice!
We should be able to auto transition by changing if(doPlaneYZ) to if(xRange.size
= here.maxTaskPar)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/npadmana/DistributedFFT/issues/24?email_source=notifications&email_token=AAER62VVOKTQKMEEVD6YC53QL42FDA5CNFSM4I3M7FV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72WPAY#issuecomment-536176515, or mute the thread https://github.com/notifications/unsubscribe-auth/AAER62VW7H4M5ZJ7I32RVPLQL42FDANCNFSM4I3M7FVQ .
Ah I see. Maybe the check would need to be 1/2 or 1/4 maxTaskPar. I'll do some runs to figure out if we can find a reasonable threshold.
Runs with auto-switching at here.maxTaskPar/2
(closes the gap with master pretty well, especially at larger problem sizes):
nodes | ref MPI | master | pencilFFT-2 | FFT-2-hybrid-yz |
---|---|---|---|---|
1 | 262.2s | 157.7s | 143.5s | 132.3s |
2 | 156.5s | 84.8s | 92.8s | 89.2s |
4 | 93.8s | 60.8s | 70.1s | 68.5s |
8 | 51.0s | 36.5s | 40.6s | 40.6s |
16 | 25.4s | 20.9s | 24.8s | 22.6s |
32 | 14.1s | 14.5s | 13.4s | 12.3s |
64 | 10.4s | 9.6s | 7.0s | 7.0s |
128 | 5.2s | 7.4s | 3.8s | 3.8s |
256 | 2.8s | 6.1s | 2.1s | 2.1s |
512 | 1.6s | 5.6s | 1.3s | 1.3s |
nodes | ref MPI | master | pencilFFT-2 | FFT-2-hybrid-yz |
---|---|---|---|---|
8 | 510.7s | 290.8s | 474.9s | 397.3s |
16 | 240.1s | 163.1s | 245.5s | 213.6s |
32 | 115.8s | 99.3s | 144.0s | 120.6s |
64 | 59.0s | 63.6s | 97.0s | 73.5s |
128 | 51.8s | 42.5s | 50.7s | 50.7s |
256 | 25.7s | 29.9s | 28.2s | 28.1s |
512 | 12.6s | 25.0s | 14.2s | 14.2s |
nodes | ref MPI | master | pencilFFT-2 | FFT-2-hybrid-yz |
---|---|---|---|---|
64 | 974.0s | OOM | 623.2s | 539.0s |
128 | 457.0s | 316.9s | 436.3s | 333.8s |
256 | 266.7s | 193.2s | 243.8s | 242.2s |
512 | 133.7s | 138.1s | 123.1s | 122.8s |
Closing in favor of #27, which has all of this functionality.
With the YZ transforms, there are two possible strategies. We can do each YZ plane as a 2D FFT (this is the strategy adopted in
master
). This reduces the number of FFT calls and allows FFTW to use a more efficient transform. However, in the limit that the number of YZ planes << number of cores, it limits the available parallelism. In this case, a better strategy is to do each line in YZ separately.pencilFFT-2-hybrid-YZ
introduces a switch--doPlaneYZ
to switch between the two strategies. The yz plane might be better except for cases where we're pretty underutilized.Tracking this on a branch
pencilFFT-2-hybrid-YZ
. Times are for E on 64 nodes on swan (incomplete, couldn't get 64 nodes instantly).--doPlaneYZ
Case E, 32 nodes, swan --
--doPlaneYZ=false
--doPlaneYZ=true