Compare FFT timing runs

npadmana commented 4 years ago

Both for CHIUW and an ultimate chplUltra paper (hopefully soon), it would be good to have a direct comparison of the performance of different distributed FFT implementations

[ ] Chapel
[ ] PFFT (https://github.com/mpip/pfft) (2D decomposition)
[ ] FFTW (1D decomposition, MPI only)
[ ] FFTW (1D decompositon, MPI+OpenMP -- I expect this not to scale well).
[ ] P3DFFT (https://www.p3dfft.net/) (this just has real-to-complex and complex-to-real transforms, so requires closing #66 and the related topic in #36)

We should do basic timing runs for all of these -- just back and forth FFTs, ideally exploring both the strong and weak scaling of these systems.

npadmana commented 4 years ago

@ronawho -- just flagging this for you.

npadmana commented 4 years ago

Just so that I don't lose track of these numbers, here are the timings for pfft on swan on the sk40 partition (running with 32 cores per node, instead of the 40). These are raw numbers, but it looks like the best performance is obtained for the processor grid being <nnodes>x<ncpus>.

Running Ngrid=1024, processor grid=2 32, cpus=64, loops=20
tune_forw = 7.69e+00; tune_back = 6.27e+00, exec_forw = 2.11e+00, exec_back = 2.80e+00, error = 3.13e-13
Running Ngrid=1024, processor grid=4 32, cpus=128, loops=20
tune_forw = 6.69e+00; tune_back = 4.69e+00, exec_forw = 1.24e+00, exec_back = 1.59e+00, error = 2.42e-13
Running Ngrid=1024, processor grid=8 32, cpus=256, loops=20
tune_forw = 3.47e+00; tune_back = 2.94e+00, exec_forw = 7.29e-01, exec_back = 9.17e-01, error = 2.37e-13
Running Ngrid=1024, processor grid=32 8, cpus=256, loops=20
tune_forw = 3.29e+00; tune_back = 2.82e+00, exec_forw = 8.08e-01, exec_back = 9.55e-01, error = 2.37e-13
Running Ngrid=1024, processor grid=16 16, cpus=256, loops=20
tune_forw = 3.52e+00; tune_back = 2.78e+00, exec_forw = 8.25e-01, exec_back = 9.71e-01, error = 3.13e-13
Running Ngrid=1024, processor grid=16 32, cpus=512, loops=20
tune_forw = 2.56e+00; tune_back = 1.96e+00, exec_forw = 4.52e-01, exec_back = 5.63e-01, error = 2.80e-13
Running Ngrid=1024, processor grid=32 16, cpus=512, loops=20
tune_forw = 2.48e+00; tune_back = 1.57e+00, exec_forw = 5.21e-01, exec_back = 5.97e-01, error = 2.46e-13
Running Ngrid=1024, processor grid=32 32, cpus=1024, loops=20
tune_forw = 1.17e+00; tune_back = 1.09e+00, exec_forw = 2.28e-01, exec_back = 2.55e-01, error = 2.35e-13

npadmana commented 4 years ago

And here are the Chapel timings on the bw44 partition on swan....

numLocales= 2 Ng=1024 diff=2.99e-16 planTime=     3.7 runTime(F)=     1.4 runTime(B)=     1.4
numLocales= 4 Ng=1024 diff=2.99e-16 planTime=     2.8 runTime(F)=    0.93 runTime(B)=    0.93
numLocales= 8 Ng=1024 diff=1.11e-16 planTime=     2.1 runTime(F)=    0.59 runTime(B)=    0.58
numLocales=16 Ng=1024 diff=2.78e-17 planTime=     1.6 runTime(F)=    0.33 runTime(B)=    0.33
numLocales=32 Ng=1024 diff=1.09e-16 planTime=     1.3 runTime(F)=    0.19 runTime(B)=    0.19

npadmana / DistributedFFT

Compare FFT timing runs #70