npadmana / DistributedFFT

6 stars 2 forks source link

Rerun full timing suite... #21

Closed npadmana closed 5 years ago

npadmana commented 5 years ago

@ronawho -- could we run the full timing suite for pencilFFT-2 for D,E,F to update figures on the abstract?

ronawho commented 5 years ago

Yup, already working on it.

ronawho commented 5 years ago

Results for ref vs. master(bf40d24) vs. pencilFFT-1(e9047e7) vs. pencilFFT-2(834b180):

Size D:

nodes ref MPI master pencilFFT-1 pencilFFT-2
1 262.2s 157.7s 159.1s 143.5s
2 156.5s 84.8s 97.9s 92.8s
4 93.8s 60.8s 74.2s 70.1s
8 51.0s 36.5s 45.6s 40.6s
16 25.4s 20.9s 24.6s 24.8s
32 14.1s 14.5s 14.2s 13.4s
64 10.4s 9.6s 7.5s 7.0s
128 5.2s 7.4s 4.3s 3.8s
256 2.8s 6.1s 2.4s 2.1s
512 1.6s 5.6s 1.4s 1.3s

pencilFFT-2 looks great for size D. It always beats the reference, and only loses to master by a little (<20%) for 2-16 nodes.

Size E:

nodes ref MPI master pencilFFT-1 pencilFFT-2
8 510.7s 290.8s 482.3s 474.9s
16 240.1s 163.1s 257.0s 245.5s
32 115.8s 99.3s 140.3s 144.0s
64 59.0s 63.6s 93.2s 97.0s
128 51.8s 42.5s 49.6s 50.7s
256 25.7s 29.9s 26.6s 28.2s
512 12.6s 25.0s 13.6s 14.2s

Unfortunately, pencilFFT-2 is not as great for size E. It's behind the reference for 16-64 nodes, and loses to master <= 128 nodes.

Size F:

nodes ref MPI master pencilFFT-1 pencilFFT-2
64 974.0s OOM OOM 623.2s
128 457.0s 316.9s 427.5s 436.3s
256 266.7s 193.2s 225.1s 243.8s
512 133.7s 138.1s 121.1s 123.1s

pencilFFT-2 successfully runs for size F with 64 nodes and always beats the reference, but is pretty far behind master for 128/256 nodes. It is also slower than pencilFFT-1


I will generate graphs for these results later tonight

I'm not sure when I'll have time to dig into the size E/F results

npadmana commented 5 years ago

Thanks! These are very interesting....

The OOM for pencilFFT-1 is not surprising, since in this limit, it looks a lot like master (in that it needs a number of local planes).

Will you have access to the machine on Monday to run a possible last round of timings?

ronawho commented 5 years ago

Yup, I should have some access Sat/Sun and Monday looks pretty open

ronawho commented 5 years ago

I was curious about how much slower each section is between master and pencilFFT-2. Here's some maybe interesting timings comparing size E for 64 nodes between master and pencilFFT-2 on swan (I just printed out the total yz and x times per iteration):

master

yz=0.30, x=1.74: Checksum(1) =  5.121601045346e+02 + 5.117395998266e+02i
yz=0.32, x=1.77: Checksum(2) =  5.120905403678e+02 + 5.118614716182e+02i
yz=0.35, x=1.74: Checksum(3) =  5.120623229306e+02 + 5.119074203747e+02i
yz=0.36, x=1.76: Checksum(4) =  5.120438418997e+02 + 5.119345900733e+02i
yz=0.39, x=1.77: Checksum(5) =  5.120311521872e+02 + 5.119551325550e+02i
yz=0.43, x=1.77: Checksum(6) =  5.120226088809e+02 + 5.119720179919e+02i
yz=0.42, x=1.76: Checksum(7) =  5.120169296534e+02 + 5.119861371665e+02i
yz=0.42, x=1.76: Checksum(8) =  5.120131225172e+02 + 5.119979364402e+02i
yz=0.44, x=1.75: Checksum(9) =  5.120104767108e+02 + 5.120077674092e+02i
yz=0.37, x=1.77: Checksum(10) =  5.120085127969e+02 + 5.120159443121e+02i
yz=0.45, x=1.74: Checksum(11) =  5.120069224127e+02 + 5.120227453670e+02i
yz=0.45, x=1.78: Checksum(12) =  5.120055158164e+02 + 5.120284096041e+02i
yz=0.44, x=1.75: Checksum(13) =  5.120041820159e+02 + 5.120331373793e+02i
yz=0.44, x=1.75: Checksum(14) =  5.120028605402e+02 + 5.120370938679e+02i
yz=0.47, x=1.75: Checksum(15) =  5.120015223011e+02 + 5.120404138831e+02i
yz=0.48, x=1.74: Checksum(16) =  5.120001570022e+02 + 5.120432068837e+02i
yz=0.52, x=1.78: Checksum(17) =  5.119987650555e+02 + 5.120455615860e+02i
yz=1.68, x=1.79: Checksum(18) =  5.119973525091e+02 + 5.120475499442e+02i
yz=1.62, x=1.77: Checksum(19) =  5.119959279472e+02 + 5.120492304629e+02i
yz=1.53, x=1.79: Checksum(20) =  5.119945006558e+02 + 5.120506508902e+02i
yz=1.58, x=1.76: Checksum(21) =  5.119930795911e+02 + 5.120518503782e+02i
yz=1.63, x=1.83: Checksum(22) =  5.119916728462e+02 + 5.120528612016e+02i
yz=1.52, x=1.77: Checksum(23) =  5.119902874185e+02 + 5.120537101195e+02i
yz=1.61, x=1.76: Checksum(24) =  5.119889291565e+02 + 5.120544194514e+02i
yz=1.63, x=1.77: Checksum(25) =  5.119876028049e+02 + 5.120550079284e+02i

x always takes ~1.75 seconds, but curiously the yz transform gets slower over time

pencilFFT-2

yz=1.10, x=2.97: Checksum(1) = 5.121601045346e+02 + 5.117395998266e+02i yz=1.11, x=2.96: Checksum(2) = 5.120905403678e+02 + 5.118614716182e+02i yz=1.10, x=3.01: Checksum(3) = 5.120623229306e+02 + 5.119074203747e+02i yz=1.10, x=2.98: Checksum(4) = 5.120438418997e+02 + 5.119345900733e+02i yz=1.10, x=3.01: Checksum(5) = 5.120311521872e+02 + 5.119551325550e+02i yz=1.10, x=3.00: Checksum(6) = 5.120226088809e+02 + 5.119720179919e+02i yz=1.10, x=3.00: Checksum(7) = 5.120169296534e+02 + 5.119861371665e+02i yz=1.10, x=3.04: Checksum(8) = 5.120131225172e+02 + 5.119979364402e+02i yz=1.40, x=2.95: Checksum(9) = 5.120104767108e+02 + 5.120077674092e+02i yz=1.60, x=3.02: Checksum(10) = 5.120085127969e+02 + 5.120159443121e+02i yz=1.58, x=3.05: Checksum(11) = 5.120069224127e+02 + 5.120227453670e+02i yz=1.57, x=3.04: Checksum(12) = 5.120055158164e+02 + 5.120284096041e+02i yz=1.53, x=2.96: Checksum(13) = 5.120041820159e+02 + 5.120331373793e+02i yz=1.51, x=3.00: Checksum(14) = 5.120028605402e+02 + 5.120370938679e+02i yz=1.54, x=3.02: Checksum(15) = 5.120015223011e+02 + 5.120404138831e+02i yz=1.51, x=3.02: Checksum(16) = 5.120001570022e+02 + 5.120432068837e+02i yz=1.51, x=3.02: Checksum(17) = 5.119987650555e+02 + 5.120455615860e+02i yz=1.68, x=2.92: Checksum(18) = 5.119973525091e+02 + 5.120475499442e+02i yz=1.68, x=3.04: Checksum(19) = 5.119959279472e+02 + 5.120492304629e+02i yz=1.59, x=3.03: Checksum(20) = 5.119945006558e+02 + 5.120506508902e+02i yz=1.64, x=3.01: Checksum(21) = 5.119930795911e+02 + 5.120518503782e+02i yz=1.66, x=3.04: Checksum(22) = 5.119916728462e+02 + 5.120528612016e+02i yz=1.57, x=3.04: Checksum(23) = 5.119902874185e+02 + 5.120537101195e+02i yz=1.63, x=3.00: Checksum(24) = 5.119889291565e+02 + 5.120544194514e+02i yz=1.60, x=2.96: Checksum(25) = 5.119876028049e+02 + 5.120550079284e+02i

x always takes ~3 seconds (1.25 seconds slower than master), and yz starts off at 1.1s compared to 0.3s on master, but they both grow to 1.6s)


I'm not sure I understand why the yz transform seems to slow down, but maybe that's obvious to you?

I wonder if we should be changing our parallelism strategy for the YZ section when we're on a lower number of nodes

npadmana commented 5 years ago

Hm. I can't reproduce this behavior for pencilFFT-2. On swan, with 64 nodes and the E problem size, I see yz taking ~1.1 seconds, while x takes ~3 seconds (I ran it a few different times, and never got this slowdown).

But your larger point is a good one -- we might want two different YZ strategies. Let me try this....

ronawho commented 5 years ago

I've been doing full runs on Crystal with PrgEnv-intel, but I've been using PrgEnv-gnu on Swan for faster compiles while experimenting. It looks like the slower YZ is only occurring for PrgEnv-gnu. I see a very stable YZ=1.1, X=~2.8 for PrgEnv-intel.

Curious, but a curiosity for another day -- I'll switch to using intel on Swan for now.

npadmana commented 5 years ago

I'm closing this, since #25 captures the remaining mystery.