npadmana commented 5 years ago

With the YZ transforms, there are two possible strategies. We can do each YZ plane as a 2D FFT (this is the strategy adopted in master). This reduces the number of FFT calls and allows FFTW to use a more efficient transform. However, in the limit that the number of YZ planes << number of cores, it limits the available parallelism. In this case, a better strategy is to do each line in YZ separately.

pencilFFT-2-hybrid-YZ introduces a switch --doPlaneYZ to switch between the two strategies. The yz plane might be better except for cases where we're pretty underutilized.

Tracking this on a branch pencilFFT-2-hybrid-YZ. Times are for E on 64 nodes on swan (incomplete, couldn't get 64 nodes instantly).

commit	time	time for each yz	comment
6ef4c6220ec23dd	115s	1.1s	pencilFFT-2 base, with individual timing
5e46bb746f27a2979		0.31s	run with `--doPlaneYZ`

Case E, 32 nodes, swan --

commit	time	time for each yz	comment
5e46bb746f27a2979	172.81s	1.34s	`--doPlaneYZ=false`
5e46bb746f27a2979	153.2s	0.57s	`--doPlaneYZ=true`

npadmana commented 5 years ago

@ronawho -- FYI.

ronawho commented 5 years ago

Nice!

We should be able to auto transition by changing if(doPlaneYZ) to if(xRange.size >= here.maxTaskPar)

ronawho commented 5 years ago

The code growth is a little unfortunate, but I think there's huge benefit from giving fftw bigger chunks to perform on. I'm guessing we'll need to do so something similar for the X transpose too, and end up with a pencil-master hybrid after all.

npadmana commented 5 years ago

Unfortunately, a simple auto-transition doesn’t work. For instance, at 64 nodes for E, there are 32 YZ planes per node, which is less than the total number of cores. But you still win by doing this using planes. That’s why I went to a manual switch.

I agree that we could likely win a little more by doing something similar with x.

On Sat, Sep 28, 2019 at 6:58 AM Elliot Ronaghan notifications@github.com wrote:

Nice!

We should be able to auto transition by changing if(doPlaneYZ) to if(xRange.size

= here.maxTaskPar)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/npadmana/DistributedFFT/issues/24?email_source=notifications&email_token=AAER62VVOKTQKMEEVD6YC53QL42FDA5CNFSM4I3M7FV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD72WPAY#issuecomment-536176515, or mute the thread https://github.com/notifications/unsubscribe-auth/AAER62VW7H4M5ZJ7I32RVPLQL42FDANCNFSM4I3M7FVQ .

ronawho commented 5 years ago

Ah I see. Maybe the check would need to be 1/2 or 1/4 maxTaskPar. I'll do some runs to figure out if we can find a reasonable threshold.

ronawho commented 5 years ago

Runs with auto-switching at here.maxTaskPar/2 (closes the gap with master pretty well, especially at larger problem sizes):

Size D:

nodes	ref MPI	master	pencilFFT-2	FFT-2-hybrid-yz
1	262.2s	157.7s	143.5s	132.3s
2	156.5s	84.8s	92.8s	89.2s
4	93.8s	60.8s	70.1s	68.5s
8	51.0s	36.5s	40.6s	40.6s
16	25.4s	20.9s	24.8s	22.6s
32	14.1s	14.5s	13.4s	12.3s
64	10.4s	9.6s	7.0s	7.0s
128	5.2s	7.4s	3.8s	3.8s
256	2.8s	6.1s	2.1s	2.1s
512	1.6s	5.6s	1.3s	1.3s

Size E:

nodes	ref MPI	master	pencilFFT-2	FFT-2-hybrid-yz
8	510.7s	290.8s	474.9s	397.3s
16	240.1s	163.1s	245.5s	213.6s
32	115.8s	99.3s	144.0s	120.6s
64	59.0s	63.6s	97.0s	73.5s
128	51.8s	42.5s	50.7s	50.7s
256	25.7s	29.9s	28.2s	28.1s
512	12.6s	25.0s	14.2s	14.2s

Size F:

nodes	ref MPI	master	pencilFFT-2	FFT-2-hybrid-yz
64	974.0s	OOM	623.2s	539.0s
128	457.0s	316.9s	436.3s	333.8s
256	266.7s	193.2s	243.8s	242.2s
512	133.7s	138.1s	123.1s	122.8s

npadmana commented 5 years ago

Closing in favor of #27, which has all of this functionality.

npadmana / DistributedFFT

Explore a hybrid YZ strategy #24

Size D:

Size E:

Size F: