npadmana / DistributedFFT

6 stars 2 forks source link

Explore a hybrid YZ strategy #24

Closed npadmana closed 5 years ago

npadmana commented 5 years ago

With the YZ transforms, there are two possible strategies. We can do each YZ plane as a 2D FFT (this is the strategy adopted in master). This reduces the number of FFT calls and allows FFTW to use a more efficient transform. However, in the limit that the number of YZ planes << number of cores, it limits the available parallelism. In this case, a better strategy is to do each line in YZ separately.

pencilFFT-2-hybrid-YZ introduces a switch --doPlaneYZ to switch between the two strategies. The yz plane might be better except for cases where we're pretty underutilized.


Tracking this on a branch pencilFFT-2-hybrid-YZ. Times are for E on 64 nodes on swan (incomplete, couldn't get 64 nodes instantly).

commit time time for each yz comment
6ef4c6220ec23dd 115s 1.1s pencilFFT-2 base, with individual timing
5e46bb746f27a2979 0.31s run with --doPlaneYZ

Case E, 32 nodes, swan --

commit time time for each yz comment
5e46bb746f27a2979 172.81s 1.34s --doPlaneYZ=false
5e46bb746f27a2979 153.2s 0.57s --doPlaneYZ=true
npadmana commented 5 years ago

@ronawho -- FYI.

ronawho commented 5 years ago

Nice!

We should be able to auto transition by changing if(doPlaneYZ) to if(xRange.size >= here.maxTaskPar)

ronawho commented 5 years ago

The code growth is a little unfortunate, but I think there's huge benefit from giving fftw bigger chunks to perform on. I'm guessing we'll need to do so something similar for the X transpose too, and end up with a pencil-master hybrid after all.

npadmana commented 5 years ago

Unfortunately, a simple auto-transition doesn’t work. For instance, at 64 nodes for E, there are 32 YZ planes per node, which is less than the total number of cores. But you still win by doing this using planes. That’s why I went to a manual switch.

I agree that we could likely win a little more by doing something similar with x.

On Sat, Sep 28, 2019 at 6:58 AM Elliot Ronaghan notifications@github.com wrote:

Nice!

We should be able to auto transition by changing if(doPlaneYZ) to if(xRange.size

= here.maxTaskPar)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/npadmana/DistributedFFT/issues/24?email_source=notifications&email_token=AAER62VVOKTQKMEEVD6YC53QL42FDA5CNFSM4I3M7FV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72WPAY#issuecomment-536176515, or mute the thread https://github.com/notifications/unsubscribe-auth/AAER62VW7H4M5ZJ7I32RVPLQL42FDANCNFSM4I3M7FVQ .

ronawho commented 5 years ago

Ah I see. Maybe the check would need to be 1/2 or 1/4 maxTaskPar. I'll do some runs to figure out if we can find a reasonable threshold.

ronawho commented 5 years ago

Runs with auto-switching at here.maxTaskPar/2 (closes the gap with master pretty well, especially at larger problem sizes):

Size D:

nodes ref MPI master pencilFFT-2 FFT-2-hybrid-yz
1 262.2s 157.7s 143.5s 132.3s
2 156.5s 84.8s 92.8s 89.2s
4 93.8s 60.8s 70.1s 68.5s
8 51.0s 36.5s 40.6s 40.6s
16 25.4s 20.9s 24.8s 22.6s
32 14.1s 14.5s 13.4s 12.3s
64 10.4s 9.6s 7.0s 7.0s
128 5.2s 7.4s 3.8s 3.8s
256 2.8s 6.1s 2.1s 2.1s
512 1.6s 5.6s 1.3s 1.3s

Size E:

nodes ref MPI master pencilFFT-2 FFT-2-hybrid-yz
8 510.7s 290.8s 474.9s 397.3s
16 240.1s 163.1s 245.5s 213.6s
32 115.8s 99.3s 144.0s 120.6s
64 59.0s 63.6s 97.0s 73.5s
128 51.8s 42.5s 50.7s 50.7s
256 25.7s 29.9s 28.2s 28.1s
512 12.6s 25.0s 14.2s 14.2s

Size F:

nodes ref MPI master pencilFFT-2 FFT-2-hybrid-yz
64 974.0s OOM 623.2s 539.0s
128 457.0s 316.9s 436.3s 333.8s
256 266.7s 193.2s 243.8s 242.2s
512 133.7s 138.1s 123.1s 122.8s
npadmana commented 5 years ago

Closing in favor of #27, which has all of this functionality.