Closed npadmana closed 5 years ago
Yup, already working on it.
Results for ref vs. master(bf40d24) vs. pencilFFT-1(e9047e7) vs. pencilFFT-2(834b180):
nodes | ref MPI | master | pencilFFT-1 | pencilFFT-2 |
---|---|---|---|---|
1 | 262.2s | 157.7s | 159.1s | 143.5s |
2 | 156.5s | 84.8s | 97.9s | 92.8s |
4 | 93.8s | 60.8s | 74.2s | 70.1s |
8 | 51.0s | 36.5s | 45.6s | 40.6s |
16 | 25.4s | 20.9s | 24.6s | 24.8s |
32 | 14.1s | 14.5s | 14.2s | 13.4s |
64 | 10.4s | 9.6s | 7.5s | 7.0s |
128 | 5.2s | 7.4s | 4.3s | 3.8s |
256 | 2.8s | 6.1s | 2.4s | 2.1s |
512 | 1.6s | 5.6s | 1.4s | 1.3s |
pencilFFT-2 looks great for size D. It always beats the reference, and only loses to master by a little (<20%) for 2-16 nodes.
nodes | ref MPI | master | pencilFFT-1 | pencilFFT-2 |
---|---|---|---|---|
8 | 510.7s | 290.8s | 482.3s | 474.9s |
16 | 240.1s | 163.1s | 257.0s | 245.5s |
32 | 115.8s | 99.3s | 140.3s | 144.0s |
64 | 59.0s | 63.6s | 93.2s | 97.0s |
128 | 51.8s | 42.5s | 49.6s | 50.7s |
256 | 25.7s | 29.9s | 26.6s | 28.2s |
512 | 12.6s | 25.0s | 13.6s | 14.2s |
Unfortunately, pencilFFT-2 is not as great for size E. It's behind the reference for 16-64 nodes, and loses to master <= 128 nodes.
nodes | ref MPI | master | pencilFFT-1 | pencilFFT-2 |
---|---|---|---|---|
64 | 974.0s | OOM | OOM | 623.2s |
128 | 457.0s | 316.9s | 427.5s | 436.3s |
256 | 266.7s | 193.2s | 225.1s | 243.8s |
512 | 133.7s | 138.1s | 121.1s | 123.1s |
pencilFFT-2 successfully runs for size F with 64 nodes and always beats the reference, but is pretty far behind master for 128/256 nodes. It is also slower than pencilFFT-1
I will generate graphs for these results later tonight
I'm not sure when I'll have time to dig into the size E/F results
Thanks! These are very interesting....
The OOM for pencilFFT-1
is not surprising, since in this limit, it looks a lot like master
(in that it needs a number of local planes).
Will you have access to the machine on Monday to run a possible last round of timings?
Yup, I should have some access Sat/Sun and Monday looks pretty open
I was curious about how much slower each section is between master and pencilFFT-2. Here's some maybe interesting timings comparing size E for 64 nodes between master and pencilFFT-2 on swan (I just printed out the total yz and x times per iteration):
yz=0.30, x=1.74: Checksum(1) = 5.121601045346e+02 + 5.117395998266e+02i
yz=0.32, x=1.77: Checksum(2) = 5.120905403678e+02 + 5.118614716182e+02i
yz=0.35, x=1.74: Checksum(3) = 5.120623229306e+02 + 5.119074203747e+02i
yz=0.36, x=1.76: Checksum(4) = 5.120438418997e+02 + 5.119345900733e+02i
yz=0.39, x=1.77: Checksum(5) = 5.120311521872e+02 + 5.119551325550e+02i
yz=0.43, x=1.77: Checksum(6) = 5.120226088809e+02 + 5.119720179919e+02i
yz=0.42, x=1.76: Checksum(7) = 5.120169296534e+02 + 5.119861371665e+02i
yz=0.42, x=1.76: Checksum(8) = 5.120131225172e+02 + 5.119979364402e+02i
yz=0.44, x=1.75: Checksum(9) = 5.120104767108e+02 + 5.120077674092e+02i
yz=0.37, x=1.77: Checksum(10) = 5.120085127969e+02 + 5.120159443121e+02i
yz=0.45, x=1.74: Checksum(11) = 5.120069224127e+02 + 5.120227453670e+02i
yz=0.45, x=1.78: Checksum(12) = 5.120055158164e+02 + 5.120284096041e+02i
yz=0.44, x=1.75: Checksum(13) = 5.120041820159e+02 + 5.120331373793e+02i
yz=0.44, x=1.75: Checksum(14) = 5.120028605402e+02 + 5.120370938679e+02i
yz=0.47, x=1.75: Checksum(15) = 5.120015223011e+02 + 5.120404138831e+02i
yz=0.48, x=1.74: Checksum(16) = 5.120001570022e+02 + 5.120432068837e+02i
yz=0.52, x=1.78: Checksum(17) = 5.119987650555e+02 + 5.120455615860e+02i
yz=1.68, x=1.79: Checksum(18) = 5.119973525091e+02 + 5.120475499442e+02i
yz=1.62, x=1.77: Checksum(19) = 5.119959279472e+02 + 5.120492304629e+02i
yz=1.53, x=1.79: Checksum(20) = 5.119945006558e+02 + 5.120506508902e+02i
yz=1.58, x=1.76: Checksum(21) = 5.119930795911e+02 + 5.120518503782e+02i
yz=1.63, x=1.83: Checksum(22) = 5.119916728462e+02 + 5.120528612016e+02i
yz=1.52, x=1.77: Checksum(23) = 5.119902874185e+02 + 5.120537101195e+02i
yz=1.61, x=1.76: Checksum(24) = 5.119889291565e+02 + 5.120544194514e+02i
yz=1.63, x=1.77: Checksum(25) = 5.119876028049e+02 + 5.120550079284e+02i
x always takes ~1.75 seconds, but curiously the yz transform gets slower over time
yz=1.10, x=2.97: Checksum(1) = 5.121601045346e+02 + 5.117395998266e+02i yz=1.11, x=2.96: Checksum(2) = 5.120905403678e+02 + 5.118614716182e+02i yz=1.10, x=3.01: Checksum(3) = 5.120623229306e+02 + 5.119074203747e+02i yz=1.10, x=2.98: Checksum(4) = 5.120438418997e+02 + 5.119345900733e+02i yz=1.10, x=3.01: Checksum(5) = 5.120311521872e+02 + 5.119551325550e+02i yz=1.10, x=3.00: Checksum(6) = 5.120226088809e+02 + 5.119720179919e+02i yz=1.10, x=3.00: Checksum(7) = 5.120169296534e+02 + 5.119861371665e+02i yz=1.10, x=3.04: Checksum(8) = 5.120131225172e+02 + 5.119979364402e+02i yz=1.40, x=2.95: Checksum(9) = 5.120104767108e+02 + 5.120077674092e+02i yz=1.60, x=3.02: Checksum(10) = 5.120085127969e+02 + 5.120159443121e+02i yz=1.58, x=3.05: Checksum(11) = 5.120069224127e+02 + 5.120227453670e+02i yz=1.57, x=3.04: Checksum(12) = 5.120055158164e+02 + 5.120284096041e+02i yz=1.53, x=2.96: Checksum(13) = 5.120041820159e+02 + 5.120331373793e+02i yz=1.51, x=3.00: Checksum(14) = 5.120028605402e+02 + 5.120370938679e+02i yz=1.54, x=3.02: Checksum(15) = 5.120015223011e+02 + 5.120404138831e+02i yz=1.51, x=3.02: Checksum(16) = 5.120001570022e+02 + 5.120432068837e+02i yz=1.51, x=3.02: Checksum(17) = 5.119987650555e+02 + 5.120455615860e+02i yz=1.68, x=2.92: Checksum(18) = 5.119973525091e+02 + 5.120475499442e+02i yz=1.68, x=3.04: Checksum(19) = 5.119959279472e+02 + 5.120492304629e+02i yz=1.59, x=3.03: Checksum(20) = 5.119945006558e+02 + 5.120506508902e+02i yz=1.64, x=3.01: Checksum(21) = 5.119930795911e+02 + 5.120518503782e+02i yz=1.66, x=3.04: Checksum(22) = 5.119916728462e+02 + 5.120528612016e+02i yz=1.57, x=3.04: Checksum(23) = 5.119902874185e+02 + 5.120537101195e+02i yz=1.63, x=3.00: Checksum(24) = 5.119889291565e+02 + 5.120544194514e+02i yz=1.60, x=2.96: Checksum(25) = 5.119876028049e+02 + 5.120550079284e+02i
x always takes ~3 seconds (1.25 seconds slower than master), and yz starts off at 1.1s compared to 0.3s on master, but they both grow to 1.6s)
I'm not sure I understand why the yz transform seems to slow down, but maybe that's obvious to you?
I wonder if we should be changing our parallelism strategy for the YZ section when we're on a lower number of nodes
Hm. I can't reproduce this behavior for pencilFFT-2
. On swan, with 64 nodes and the E problem size, I see yz taking ~1.1 seconds, while x takes ~3 seconds (I ran it a few different times, and never got this slowdown).
But your larger point is a good one -- we might want two different YZ strategies. Let me try this....
I've been doing full runs on Crystal with PrgEnv-intel, but I've been using PrgEnv-gnu on Swan for faster compiles while experimenting. It looks like the slower YZ is only occurring for PrgEnv-gnu. I see a very stable YZ=1.1, X=~2.8 for PrgEnv-intel.
Curious, but a curiosity for another day -- I'll switch to using intel on Swan for now.
I'm closing this, since #25 captures the remaining mystery.
@ronawho -- could we run the full timing suite for pencilFFT-2 for D,E,F to update figures on the abstract?