Open npadmana opened 4 years ago
@ronawho -- just flagging this for you.
Just so that I don't lose track of these numbers, here are the timings for pfft on swan on the sk40 partition (running with 32 cores per node, instead of the 40). These are raw numbers, but it looks like the best performance is obtained for the processor grid being <nnodes>x<ncpus>
.
Running Ngrid=1024, processor grid=2 32, cpus=64, loops=20
tune_forw = 7.69e+00; tune_back = 6.27e+00, exec_forw = 2.11e+00, exec_back = 2.80e+00, error = 3.13e-13
Running Ngrid=1024, processor grid=4 32, cpus=128, loops=20
tune_forw = 6.69e+00; tune_back = 4.69e+00, exec_forw = 1.24e+00, exec_back = 1.59e+00, error = 2.42e-13
Running Ngrid=1024, processor grid=8 32, cpus=256, loops=20
tune_forw = 3.47e+00; tune_back = 2.94e+00, exec_forw = 7.29e-01, exec_back = 9.17e-01, error = 2.37e-13
Running Ngrid=1024, processor grid=32 8, cpus=256, loops=20
tune_forw = 3.29e+00; tune_back = 2.82e+00, exec_forw = 8.08e-01, exec_back = 9.55e-01, error = 2.37e-13
Running Ngrid=1024, processor grid=16 16, cpus=256, loops=20
tune_forw = 3.52e+00; tune_back = 2.78e+00, exec_forw = 8.25e-01, exec_back = 9.71e-01, error = 3.13e-13
Running Ngrid=1024, processor grid=16 32, cpus=512, loops=20
tune_forw = 2.56e+00; tune_back = 1.96e+00, exec_forw = 4.52e-01, exec_back = 5.63e-01, error = 2.80e-13
Running Ngrid=1024, processor grid=32 16, cpus=512, loops=20
tune_forw = 2.48e+00; tune_back = 1.57e+00, exec_forw = 5.21e-01, exec_back = 5.97e-01, error = 2.46e-13
Running Ngrid=1024, processor grid=32 32, cpus=1024, loops=20
tune_forw = 1.17e+00; tune_back = 1.09e+00, exec_forw = 2.28e-01, exec_back = 2.55e-01, error = 2.35e-13
And here are the Chapel timings on the bw44 partition on swan....
numLocales= 2 Ng=1024 diff=2.99e-16 planTime= 3.7 runTime(F)= 1.4 runTime(B)= 1.4
numLocales= 4 Ng=1024 diff=2.99e-16 planTime= 2.8 runTime(F)= 0.93 runTime(B)= 0.93
numLocales= 8 Ng=1024 diff=1.11e-16 planTime= 2.1 runTime(F)= 0.59 runTime(B)= 0.58
numLocales=16 Ng=1024 diff=2.78e-17 planTime= 1.6 runTime(F)= 0.33 runTime(B)= 0.33
numLocales=32 Ng=1024 diff=1.09e-16 planTime= 1.3 runTime(F)= 0.19 runTime(B)= 0.19
Both for CHIUW and an ultimate chplUltra paper (hopefully soon), it would be good to have a direct comparison of the performance of different distributed FFT implementations
We should do basic timing runs for all of these -- just back and forth FFTs, ideally exploring both the strong and weak scaling of these systems.