npadmana commented 5 years ago

@ronawho -- could we run the full timing suite for pencilFFT-2 for D,E,F to update figures on the abstract?

ronawho commented 5 years ago

Yup, already working on it.

ronawho commented 5 years ago

Results for ref vs. master(bf40d24) vs. pencilFFT-1(e9047e7) vs. pencilFFT-2(834b180):

Size D:

nodes	ref MPI	master	pencilFFT-1	pencilFFT-2
1	262.2s	157.7s	159.1s	143.5s
2	156.5s	84.8s	97.9s	92.8s
4	93.8s	60.8s	74.2s	70.1s
8	51.0s	36.5s	45.6s	40.6s
16	25.4s	20.9s	24.6s	24.8s
32	14.1s	14.5s	14.2s	13.4s
64	10.4s	9.6s	7.5s	7.0s
128	5.2s	7.4s	4.3s	3.8s
256	2.8s	6.1s	2.4s	2.1s
512	1.6s	5.6s	1.4s	1.3s

pencilFFT-2 looks great for size D. It always beats the reference, and only loses to master by a little (<20%) for 2-16 nodes.

Size E:

nodes	ref MPI	master	pencilFFT-1	pencilFFT-2
8	510.7s	290.8s	482.3s	474.9s
16	240.1s	163.1s	257.0s	245.5s
32	115.8s	99.3s	140.3s	144.0s
64	59.0s	63.6s	93.2s	97.0s
128	51.8s	42.5s	49.6s	50.7s
256	25.7s	29.9s	26.6s	28.2s
512	12.6s	25.0s	13.6s	14.2s

Unfortunately, pencilFFT-2 is not as great for size E. It's behind the reference for 16-64 nodes, and loses to master <= 128 nodes.

Size F:

nodes	ref MPI	master	pencilFFT-1	pencilFFT-2
64	974.0s	OOM	OOM	623.2s
128	457.0s	316.9s	427.5s	436.3s
256	266.7s	193.2s	225.1s	243.8s
512	133.7s	138.1s	121.1s	123.1s

pencilFFT-2 successfully runs for size F with 64 nodes and always beats the reference, but is pretty far behind master for 128/256 nodes. It is also slower than pencilFFT-1

I will generate graphs for these results later tonight

I'm not sure when I'll have time to dig into the size E/F results

npadmana commented 5 years ago

Thanks! These are very interesting....

The OOM for pencilFFT-1 is not surprising, since in this limit, it looks a lot like master (in that it needs a number of local planes).

Will you have access to the machine on Monday to run a possible last round of timings?

ronawho commented 5 years ago

Yup, I should have some access Sat/Sun and Monday looks pretty open

ronawho commented 5 years ago

I was curious about how much slower each section is between master and pencilFFT-2. Here's some maybe interesting timings comparing size E for 64 nodes between master and pencilFFT-2 on swan (I just printed out the total yz and x times per iteration):

master

yz=0.30, x=1.74: Checksum(1) =  5.121601045346e+02 + 5.117395998266e+02i
yz=0.32, x=1.77: Checksum(2) =  5.120905403678e+02 + 5.118614716182e+02i
yz=0.35, x=1.74: Checksum(3) =  5.120623229306e+02 + 5.119074203747e+02i
yz=0.36, x=1.76: Checksum(4) =  5.120438418997e+02 + 5.119345900733e+02i
yz=0.39, x=1.77: Checksum(5) =  5.120311521872e+02 + 5.119551325550e+02i
yz=0.43, x=1.77: Checksum(6) =  5.120226088809e+02 + 5.119720179919e+02i
yz=0.42, x=1.76: Checksum(7) =  5.120169296534e+02 + 5.119861371665e+02i
yz=0.42, x=1.76: Checksum(8) =  5.120131225172e+02 + 5.119979364402e+02i
yz=0.44, x=1.75: Checksum(9) =  5.120104767108e+02 + 5.120077674092e+02i
yz=0.37, x=1.77: Checksum(10) =  5.120085127969e+02 + 5.120159443121e+02i
yz=0.45, x=1.74: Checksum(11) =  5.120069224127e+02 + 5.120227453670e+02i
yz=0.45, x=1.78: Checksum(12) =  5.120055158164e+02 + 5.120284096041e+02i
yz=0.44, x=1.75: Checksum(13) =  5.120041820159e+02 + 5.120331373793e+02i
yz=0.44, x=1.75: Checksum(14) =  5.120028605402e+02 + 5.120370938679e+02i
yz=0.47, x=1.75: Checksum(15) =  5.120015223011e+02 + 5.120404138831e+02i
yz=0.48, x=1.74: Checksum(16) =  5.120001570022e+02 + 5.120432068837e+02i
yz=0.52, x=1.78: Checksum(17) =  5.119987650555e+02 + 5.120455615860e+02i
yz=1.68, x=1.79: Checksum(18) =  5.119973525091e+02 + 5.120475499442e+02i
yz=1.62, x=1.77: Checksum(19) =  5.119959279472e+02 + 5.120492304629e+02i
yz=1.53, x=1.79: Checksum(20) =  5.119945006558e+02 + 5.120506508902e+02i
yz=1.58, x=1.76: Checksum(21) =  5.119930795911e+02 + 5.120518503782e+02i
yz=1.63, x=1.83: Checksum(22) =  5.119916728462e+02 + 5.120528612016e+02i
yz=1.52, x=1.77: Checksum(23) =  5.119902874185e+02 + 5.120537101195e+02i
yz=1.61, x=1.76: Checksum(24) =  5.119889291565e+02 + 5.120544194514e+02i
yz=1.63, x=1.77: Checksum(25) =  5.119876028049e+02 + 5.120550079284e+02i

x always takes ~1.75 seconds, but curiously the yz transform gets slower over time

pencilFFT-2

yz=1.10, x=2.97: Checksum(1) = 5.121601045346e+02 + 5.117395998266e+02i yz=1.11, x=2.96: Checksum(2) = 5.120905403678e+02 + 5.118614716182e+02i yz=1.10, x=3.01: Checksum(3) = 5.120623229306e+02 + 5.119074203747e+02i yz=1.10, x=2.98: Checksum(4) = 5.120438418997e+02 + 5.119345900733e+02i yz=1.10, x=3.01: Checksum(5) = 5.120311521872e+02 + 5.119551325550e+02i yz=1.10, x=3.00: Checksum(6) = 5.120226088809e+02 + 5.119720179919e+02i yz=1.10, x=3.00: Checksum(7) = 5.120169296534e+02 + 5.119861371665e+02i yz=1.10, x=3.04: Checksum(8) = 5.120131225172e+02 + 5.119979364402e+02i yz=1.40, x=2.95: Checksum(9) = 5.120104767108e+02 + 5.120077674092e+02i yz=1.60, x=3.02: Checksum(10) = 5.120085127969e+02 + 5.120159443121e+02i yz=1.58, x=3.05: Checksum(11) = 5.120069224127e+02 + 5.120227453670e+02i yz=1.57, x=3.04: Checksum(12) = 5.120055158164e+02 + 5.120284096041e+02i yz=1.53, x=2.96: Checksum(13) = 5.120041820159e+02 + 5.120331373793e+02i yz=1.51, x=3.00: Checksum(14) = 5.120028605402e+02 + 5.120370938679e+02i yz=1.54, x=3.02: Checksum(15) = 5.120015223011e+02 + 5.120404138831e+02i yz=1.51, x=3.02: Checksum(16) = 5.120001570022e+02 + 5.120432068837e+02i yz=1.51, x=3.02: Checksum(17) = 5.119987650555e+02 + 5.120455615860e+02i yz=1.68, x=2.92: Checksum(18) = 5.119973525091e+02 + 5.120475499442e+02i yz=1.68, x=3.04: Checksum(19) = 5.119959279472e+02 + 5.120492304629e+02i yz=1.59, x=3.03: Checksum(20) = 5.119945006558e+02 + 5.120506508902e+02i yz=1.64, x=3.01: Checksum(21) = 5.119930795911e+02 + 5.120518503782e+02i yz=1.66, x=3.04: Checksum(22) = 5.119916728462e+02 + 5.120528612016e+02i yz=1.57, x=3.04: Checksum(23) = 5.119902874185e+02 + 5.120537101195e+02i yz=1.63, x=3.00: Checksum(24) = 5.119889291565e+02 + 5.120544194514e+02i yz=1.60, x=2.96: Checksum(25) = 5.119876028049e+02 + 5.120550079284e+02i

x always takes ~3 seconds (1.25 seconds slower than master), and yz starts off at 1.1s compared to 0.3s on master, but they both grow to 1.6s)

I'm not sure I understand why the yz transform seems to slow down, but maybe that's obvious to you?

I wonder if we should be changing our parallelism strategy for the YZ section when we're on a lower number of nodes

npadmana commented 5 years ago

Hm. I can't reproduce this behavior for pencilFFT-2. On swan, with 64 nodes and the E problem size, I see yz taking ~1.1 seconds, while x takes ~3 seconds (I ran it a few different times, and never got this slowdown).

But your larger point is a good one -- we might want two different YZ strategies. Let me try this....

ronawho commented 5 years ago

I've been doing full runs on Crystal with PrgEnv-intel, but I've been using PrgEnv-gnu on Swan for faster compiles while experimenting. It looks like the slower YZ is only occurring for PrgEnv-gnu. I see a very stable YZ=1.1, X=~2.8 for PrgEnv-intel.

Curious, but a curiosity for another day -- I'll switch to using intel on Swan for now.

npadmana commented 5 years ago

I'm closing this, since #25 captures the remaining mystery.

npadmana / DistributedFFT

Rerun full timing suite... #21

Size D:

Size E:

Size F:

master

pencilFFT-2