Closed rodarima closed 4 years ago
Hypothesis:
(gdb) ea
Thr # Function Source
1 1 _nanos6_loader_main() loader/main-wrapper.c:150
28 6 sim_step() src/sim.c:492
29 3 MFT_solve() src/solver.c:480
fft_alloc_*
functions has no effect.pthread_cond_wait
, so they are sleeping.Correlation found, thanks to Kevin: Disabling OmpSs-2 removing the -fompss-2 flag, produces similar results as the test.
Further testing reveals that as soon as tasks are enabled with OmpSs-2, either with mcc --ompss-2
or with clang -fompss-2
, the fft doesn't scale.
Testing in MN4 with intel mpi works fine with mcc, I couldn't test with openmpi+mcc as is missing, and I don't have all build tools required to build it myself. In any case, doesn't seem to be only related with the clang compiler.
Affinity is messing the computation time
xeon07% mpirun -n 4 ./bad
sched_getaffinity = 10000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000100000000000000000000000000000000000000000
sched_getaffinity = 00000000000001000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000010
np=4 n=4096 mean=0.112391 std=0.00107304 sem=0.000195909 csr=1FA0
mpirun -n 4 ./bad 18.94s user 1.74s system 395% cpu 5.230 total
xeon07% mpirun -n 4 ./bad
sched_getaffinity = 00000000000000000000000000000000000000000100000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000001
sched_getaffinity = 10000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000001
np=4 n=4096 mean=0.239254 std=0.0090374 sem=0.00165 csr=1FA0
mpirun -n 4 ./bad 34.15s user 1.45s system 300% cpu 11.858 total
Running with mpirun --bind-to core -n 8 ./fft
fixes the affinity problem, and results in similar computation times when running with and without -fompss-2
Without tasks:
nproc=1 N=4096 runs=10 mean=0.177037 std=0.0205196
nproc=2 N=4096 runs=10 mean=0.230461 std=0.0128447
nproc=4 N=4096 runs=10 mean=0.135661 std=0.0108876
nproc=8 N=4096 runs=10 mean=0.077069 std=0.00609838
nproc=16 N=4096 runs=10 mean=0.0442516 std=0.00145305
With tasks
nproc=1 N=4096 runs=10 mean=0.177632 std=0.0206584
nproc=2 N=4096 runs=10 mean=0.230245 std=0.00825839
nproc=4 N=4096 runs=10 mean=0.137096 std=0.010929
nproc=8 N=4096 runs=10 mean=0.0700888 std=0.000814367
nproc=16 N=4096 runs=10 mean=0.0431439 std=0.000877571
Inconsistent scaling persists with 8 and 16 processes. The MXCSR is different, example for 2 processes: MXCSR = 1F80 MXCSR = 1120
First results with 2 to 16 processes and 8192x8192 points
xeon07% mpirun --bind-to core -n 2 ./cpic conf/simd.conf
Using MPI with 2 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=2 ny=8192 csr=1FA0 fft1=8.604317e-01 fft2=8.260788e-01 total=1.871545e+00
end sim_pre_step
2.682995e+01 init-time
Simulation runs now
np=2 ny=8192 csr=1FA0 fft1=8.640430e-01 fft2=8.292589e-01 total=1.878219e+00
np=2 ny=8192 csr=1FA0 fft1=8.672197e-01 fft2=8.400842e-01 total=1.893966e+00
np=2 ny=8192 csr=1FA0 fft1=8.669346e-01 fft2=8.374721e-01 total=1.891243e+00
np=2 ny=8192 csr=1FA0 fft1=8.614438e-01 fft2=8.272335e-01 total=1.873965e+00
np=2 ny=8192 csr=1FA0 fft1=8.671607e-01 fft2=8.368296e-01 total=1.890961e+00
np=2 ny=8192 csr=1FA0 fft1=8.641553e-01 fft2=8.306946e-01 total=1.879722e+00
np=2 ny=8192 csr=1FA0 fft1=8.671189e-01 fft2=8.367603e-01 total=1.890160e+00
np=2 ny=8192 csr=1FA0 fft1=8.628957e-01 fft2=8.258258e-01 total=1.873778e+00
np=2 ny=8192 csr=1FA0 fft1=8.623579e-01 fft2=8.264706e-01 total=1.873419e+00
np=2 ny=8192 csr=1FA0 fft1=8.678141e-01 fft2=8.372426e-01 total=1.891370e+00
Simulation ends
mpirun --bind-to core -n 2 ./cpic conf/simd.conf 102.85s user 6.89s system 200% cpu 54.795 total
xeon07% mpirun --bind-to core -n 4 ./cpic conf/simd.conf
Using MPI with 4 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=4 ny=8192 csr=1FA0 fft1=4.968215e-01 fft2=4.366530e-01 total=1.025943e+00
end sim_pre_step
1.305764e+01 init-time
Simulation runs now
np=4 ny=8192 csr=1FA0 fft1=4.888453e-01 fft2=4.234810e-01 total=1.005508e+00
np=4 ny=8192 csr=1FA0 fft1=4.778294e-01 fft2=4.272914e-01 total=9.985807e-01
np=4 ny=8192 csr=1FA0 fft1=4.761114e-01 fft2=4.492657e-01 total=1.017983e+00
np=4 ny=8192 csr=1FA0 fft1=4.790637e-01 fft2=4.263159e-01 total=9.990394e-01
np=4 ny=8192 csr=1FA0 fft1=4.762635e-01 fft2=4.232742e-01 total=9.916762e-01
np=4 ny=8192 csr=1FA0 fft1=4.783264e-01 fft2=4.262032e-01 total=9.997419e-01
np=4 ny=8192 csr=1FA0 fft1=4.766619e-01 fft2=4.240529e-01 total=9.929669e-01
np=4 ny=8192 csr=1FA0 fft1=4.761279e-01 fft2=4.242295e-01 total=9.937934e-01
np=4 ny=8192 csr=1FA0 fft1=4.756417e-01 fft2=4.246880e-01 total=9.925117e-01
np=4 ny=8192 csr=1FA0 fft1=4.787048e-01 fft2=4.260658e-01 total=9.984763e-01
Simulation ends
mpirun --bind-to core -n 4 ./cpic conf/simd.conf 104.00s user 6.91s system 399% cpu 27.768 total
xeon07% mpirun --bind-to core -n 8 ./cpic conf/simd.conf
Using MPI with 8 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000010000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000010000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=8 ny=8192 csr=1FA0 fft1=2.879827e-01 fft2=2.553473e-01 total=5.943086e-01
end sim_pre_step
7.186683e+00 init-time
Simulation runs now
np=8 ny=8192 csr=1FA0 fft1=2.859903e-01 fft2=2.556312e-01 total=6.012210e-01
np=8 ny=8192 csr=1FA0 fft1=2.851353e-01 fft2=2.544299e-01 total=5.975852e-01
np=8 ny=8192 csr=1FA0 fft1=2.854144e-01 fft2=2.559104e-01 total=6.019039e-01
np=8 ny=8192 csr=1FA0 fft1=2.916227e-01 fft2=2.611693e-01 total=6.062794e-01
np=8 ny=8192 csr=1FA0 fft1=2.895605e-01 fft2=2.614378e-01 total=6.013092e-01
np=8 ny=8192 csr=1FA0 fft1=2.887576e-01 fft2=2.613469e-01 total=6.011911e-01
np=8 ny=8192 csr=1FA0 fft1=2.901283e-01 fft2=2.623049e-01 total=6.029805e-01
np=8 ny=8192 csr=1FA0 fft1=2.896093e-01 fft2=2.622759e-01 total=6.026710e-01
np=8 ny=8192 csr=1FA0 fft1=2.923282e-01 fft2=2.630958e-01 total=6.064110e-01
np=8 ny=8192 csr=1FA0 fft1=2.900353e-01 fft2=2.628494e-01 total=6.043966e-01
Simulation ends
mpirun --bind-to core -n 8 ./cpic conf/simd.conf 119.37s user 7.39s system 795% cpu 15.939 total
xeon07% mpirun --bind-to core -n 16 ./cpic conf/simd.conf
Using MPI with 16 processors
No output path specified, output will not be saved
sched_getaffinity = 00000100000000000000000000000000000000000000000000000000
sched_getaffinity = 00000010000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000001000000000000000000000000000000000000000000000000
sched_getaffinity = 00001000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000010000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000001000000
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000010000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000100000000
sched_getaffinity = 00000000000000000000000000000000000000000000001000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
begin sim_pre_step
np=16 ny=8192 csr=1FA0 fft1=1.903229e-01 fft2=3.527895e-01 total=5.723643e-01
end sim_pre_step
5.714358e+00 init-time
Simulation runs now
np=16 ny=8192 csr=1FA0 fft1=1.879377e-01 fft2=3.536457e-01 total=5.720859e-01
np=16 ny=8192 csr=1FA0 fft1=1.918415e-01 fft2=3.497640e-01 total=5.699190e-01
np=16 ny=8192 csr=1FA0 fft1=1.905309e-01 fft2=3.525956e-01 total=5.712501e-01
np=16 ny=8192 csr=1FA0 fft1=1.912817e-01 fft2=3.489942e-01 total=5.702936e-01
np=16 ny=8192 csr=1FA0 fft1=1.910022e-01 fft2=3.506647e-01 total=5.722780e-01
np=16 ny=8192 csr=1FA0 fft1=1.873093e-01 fft2=3.507568e-01 total=5.670942e-01
np=16 ny=8192 csr=1FA0 fft1=1.917762e-01 fft2=3.498396e-01 total=5.704971e-01
np=16 ny=8192 csr=1FA0 fft1=1.915747e-01 fft2=3.527666e-01 total=5.745011e-01
np=16 ny=8192 csr=1FA0 fft1=1.882995e-01 fft2=3.483427e-01 total=5.667194e-01
np=16 ny=8192 csr=1FA0 fft1=1.880960e-01 fft2=3.530721e-01 total=5.712555e-01
Simulation ends
mpirun --bind-to core -n 16 ./cpic conf/simd.conf 194.86s user 9.76s system 1583% cpu 12.920 total
xeon07%
Same for 4096x4096, very similar times as in our example test.
xeon07% mpirun --bind-to core -n 2 ./cpic conf/simd.conf
Using MPI with 2 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=2 ny=4096 csr=1FA0 fft1=2.094326e-01 fft2=1.940707e-01 total=4.498289e-01
end sim_pre_step
7.157842e+00 init-time
Simulation runs now
np=2 ny=4096 csr=1FA0 fft1=2.110798e-01 fft2=1.950352e-01 total=4.531002e-01
np=2 ny=4096 csr=1FA0 fft1=2.092094e-01 fft2=1.935608e-01 total=4.488594e-01
np=2 ny=4096 csr=1FA0 fft1=2.090223e-01 fft2=1.942547e-01 total=4.493191e-01
np=2 ny=4096 csr=1FA0 fft1=2.093447e-01 fft2=1.939311e-01 total=4.492227e-01
np=2 ny=4096 csr=1FA0 fft1=2.094303e-01 fft2=1.941198e-01 total=4.495360e-01
np=2 ny=4096 csr=1FA0 fft1=2.094308e-01 fft2=1.942945e-01 total=4.501178e-01
np=2 ny=4096 csr=1FA0 fft1=2.096134e-01 fft2=1.939866e-01 total=4.495821e-01
np=2 ny=4096 csr=1FA0 fft1=2.113290e-01 fft2=1.949964e-01 total=4.532499e-01
np=2 ny=4096 csr=1FA0 fft1=2.109454e-01 fft2=1.949930e-01 total=4.525338e-01
np=2 ny=4096 csr=1FA0 fft1=2.110181e-01 fft2=1.953466e-01 total=4.529222e-01
Simulation ends
mpirun --bind-to core -n 2 ./cpic conf/simd.conf 26.24s user 1.87s system 199% cpu 14.101 total
xeon07% mpirun --bind-to core -n 4 ./cpic conf/simd.conf
Using MPI with 4 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000010000000000000000000000000000000000000000
sched_getaffinity = 01000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=4 ny=4096 csr=1FA0 fft1=1.067506e-01 fft2=1.023684e-01 total=2.322819e-01
end sim_pre_step
3.720917e+00 init-time
Simulation runs now
np=4 ny=4096 csr=1FA0 fft1=1.067433e-01 fft2=1.023943e-01 total=2.331056e-01
np=4 ny=4096 csr=1FA0 fft1=1.054682e-01 fft2=1.025266e-01 total=2.312117e-01
np=4 ny=4096 csr=1FA0 fft1=1.038099e-01 fft2=1.022114e-01 total=2.291973e-01
np=4 ny=4096 csr=1FA0 fft1=1.044789e-01 fft2=1.026022e-01 total=2.300590e-01
np=4 ny=4096 csr=1FA0 fft1=1.038262e-01 fft2=1.025734e-01 total=2.292794e-01
np=4 ny=4096 csr=1FA0 fft1=1.057274e-01 fft2=1.013818e-01 total=2.310127e-01
np=4 ny=4096 csr=1FA0 fft1=1.065817e-01 fft2=1.036623e-01 total=2.333994e-01
np=4 ny=4096 csr=1FA0 fft1=1.046149e-01 fft2=1.034002e-01 total=2.309800e-01
np=4 ny=4096 csr=1FA0 fft1=1.042921e-01 fft2=1.021169e-01 total=2.294701e-01
np=4 ny=4096 csr=1FA0 fft1=1.039226e-01 fft2=1.026842e-01 total=2.295370e-01
Simulation ends
mpirun --bind-to core -n 4 ./cpic conf/simd.conf 27.80s user 1.33s system 394% cpu 7.378 total
xeon07% mpirun --bind-to core -n 8 ./cpic conf/simd.conf
Using MPI with 8 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000100000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
begin sim_pre_step
np=8 ny=4096 csr=1FA0 fft1=6.362851e-02 fft2=5.908889e-02 total=1.355695e-01
end sim_pre_step
2.483188e+00 init-time
Simulation runs now
np=8 ny=4096 csr=1FA0 fft1=6.346041e-02 fft2=5.876337e-02 total=1.347894e-01
np=8 ny=4096 csr=1FA0 fft1=6.184621e-02 fft2=5.904841e-02 total=1.335275e-01
np=8 ny=4096 csr=1FA0 fft1=6.265542e-02 fft2=5.890199e-02 total=1.341035e-01
np=8 ny=4096 csr=1FA0 fft1=6.186986e-02 fft2=5.902857e-02 total=1.335352e-01
np=8 ny=4096 csr=1FA0 fft1=6.166394e-02 fft2=5.767361e-02 total=1.318993e-01
np=8 ny=4096 csr=1FA0 fft1=6.186392e-02 fft2=5.891801e-02 total=1.333808e-01
np=8 ny=4096 csr=1FA0 fft1=6.235275e-02 fft2=5.970215e-02 total=1.347301e-01
np=8 ny=4096 csr=1FA0 fft1=6.189505e-02 fft2=5.983124e-02 total=1.343845e-01
np=8 ny=4096 csr=1FA0 fft1=6.190101e-02 fft2=5.926370e-02 total=1.338398e-01
np=8 ny=4096 csr=1FA0 fft1=6.180064e-02 fft2=5.909344e-02 total=1.335539e-01
Simulation ends
mpirun --bind-to core -n 8 ./cpic conf/simd.conf 35.03s user 1.79s system 778% cpu 4.732 total
xeon07% mpirun --bind-to core -n 16 ./cpic conf/simd.conf
Using MPI with 16 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000001000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000100000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000010000000
sched_getaffinity = 00000000000000000100000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000001000000
sched_getaffinity = 00001000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000001000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000010000000000000000000000
sched_getaffinity = 00000000000000000000000000000000001000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000100000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
begin sim_pre_step
np=16 ny=4096 csr=1FA0 fft1=4.061115e-02 fft2=3.844893e-02 total=8.648549e-02
end sim_pre_step
2.069708e+00 init-time
Simulation runs now
np=16 ny=4096 csr=1FA0 fft1=4.054738e-02 fft2=3.868207e-02 total=8.682283e-02
np=16 ny=4096 csr=1FA0 fft1=4.016260e-02 fft2=3.850049e-02 total=8.596047e-02
np=16 ny=4096 csr=1FA0 fft1=4.024059e-02 fft2=3.838167e-02 total=8.585370e-02
np=16 ny=4096 csr=1FA0 fft1=4.008793e-02 fft2=3.837994e-02 total=8.564743e-02
np=16 ny=4096 csr=1FA0 fft1=4.018023e-02 fft2=3.823824e-02 total=8.559563e-02
np=16 ny=4096 csr=1FA0 fft1=4.042583e-02 fft2=3.815106e-02 total=8.584279e-02
np=16 ny=4096 csr=1FA0 fft1=4.043940e-02 fft2=3.839743e-02 total=8.613534e-02
np=16 ny=4096 csr=1FA0 fft1=4.032684e-02 fft2=3.829999e-02 total=8.587248e-02
np=16 ny=4096 csr=1FA0 fft1=4.038058e-02 fft2=3.856666e-02 total=8.619559e-02
np=16 ny=4096 csr=1FA0 fft1=4.037662e-02 fft2=3.869482e-02 total=8.648246e-02
Simulation ends
mpirun --bind-to core -n 16 ./cpic conf/simd.conf 50.48s user 4.52s system 1530% cpu 3.594 total
Changes in the MXCSR register with the feenable
family, are only set in the main thread. No significant difference is shown in the computation time for the forward FFT with 4096x4096 points, but is not clear for reverse FFT. The enabled flags are:
feenableexcept(
FE_INVALID |
FE_DIVBYZERO |
FE_OVERFLOW |
FE_UNDERFLOW);
Without exceptions:
np=16 ny=4096 csr0=1FA0 fft1=4.146671e-02 fft2=3.778906e-02 total=8.734963e-02
np=16 ny=4096 csr0=1FA0 fft1=4.126434e-02 fft2=3.781021e-02 total=8.664617e-02
np=16 ny=4096 csr0=1FA0 fft1=4.146285e-02 fft2=3.784594e-02 total=8.671834e-02
np=16 ny=4096 csr0=1FA0 fft1=4.121179e-02 fft2=3.691792e-02 total=8.561996e-02
np=16 ny=4096 csr0=1FA0 fft1=4.083115e-02 fft2=3.747544e-02 total=8.599655e-02
np=16 ny=4096 csr0=1FA0 fft1=4.131030e-02 fft2=3.761803e-02 total=8.674066e-02
np=16 ny=4096 csr0=1FA0 fft1=4.163414e-02 fft2=3.773465e-02 total=8.699309e-02
np=16 ny=4096 csr0=1FA0 fft1=4.097892e-02 fft2=3.723917e-02 total=8.585914e-02
np=16 ny=4096 csr0=1FA0 fft1=4.153574e-02 fft2=3.750067e-02 total=8.673299e-02
np=16 ny=4096 csr0=1FA0 fft1=4.121890e-02 fft2=3.752832e-02 total=8.643588e-02
With exceptions enabled in some threads:
np=16 ny=4096 csr0=1120 fft1=4.212377e-02 fft2=5.838111e-02 total=1.083331e-01
np=16 ny=4096 csr0=1FA0 fft1=4.193584e-02 fft2=5.811156e-02 total=1.072809e-01
np=16 ny=4096 csr0=1FA0 fft1=4.214818e-02 fft2=5.877057e-02 total=1.080332e-01
np=16 ny=4096 csr0=1FA0 fft1=4.194330e-02 fft2=5.846063e-02 total=1.076383e-01
np=16 ny=4096 csr0=1120 fft1=4.212949e-02 fft2=5.843620e-02 total=1.076885e-01
np=16 ny=4096 csr0=1FA0 fft1=4.199488e-02 fft2=5.845302e-02 total=1.075642e-01
np=16 ny=4096 csr0=1120 fft1=4.192160e-02 fft2=5.841398e-02 total=1.074510e-01
np=16 ny=4096 csr0=1FA0 fft1=4.201088e-02 fft2=5.842290e-02 total=1.075703e-01
np=16 ny=4096 csr0=1FA0 fft1=4.211500e-02 fft2=5.817451e-02 total=1.073887e-01
np=16 ny=4096 csr0=1FA0 fft1=4.180829e-02 fft2=5.890636e-02 total=1.078599e-01
Another run:
np=16 ny=4096 csr0=1FA0 fft1=4.000103e-02 fft2=3.884999e-02 total=8.603951e-02
np=16 ny=4096 csr0=1120 fft1=4.000985e-02 fft2=3.828755e-02 total=8.554118e-02
np=16 ny=4096 csr0=1FA0 fft1=4.002022e-02 fft2=3.831978e-02 total=8.548526e-02
np=16 ny=4096 csr0=1120 fft1=4.005703e-02 fft2=3.841551e-02 total=8.565850e-02
np=16 ny=4096 csr0=1120 fft1=3.994538e-02 fft2=3.839478e-02 total=8.552366e-02
np=16 ny=4096 csr0=1FA0 fft1=4.000821e-02 fft2=3.823638e-02 total=8.537884e-02
np=16 ny=4096 csr0=1120 fft1=3.976115e-02 fft2=3.889000e-02 total=8.576684e-02
np=16 ny=4096 csr0=1120 fft1=3.994265e-02 fft2=3.827638e-02 total=8.534332e-02
np=16 ny=4096 csr0=1120 fft1=4.014546e-02 fft2=3.845588e-02 total=8.580696e-02
np=16 ny=4096 csr0=1FA0 fft1=3.987036e-02 fft2=3.871251e-02 total=8.585445e-02
With enabled exceptions in all processes:
np=16 ny=4096 csr0=1120 fft1=4.131929e-02 fft2=3.748747e-02 total=8.680457e-02
np=16 ny=4096 csr0=1120 fft1=4.117152e-02 fft2=3.721671e-02 total=8.545117e-02
np=16 ny=4096 csr0=1120 fft1=4.146759e-02 fft2=3.723733e-02 total=8.581670e-02
np=16 ny=4096 csr0=1120 fft1=4.116673e-02 fft2=3.797847e-02 total=8.636267e-02
np=16 ny=4096 csr0=1120 fft1=4.133027e-02 fft2=3.742653e-02 total=8.581166e-02
np=16 ny=4096 csr0=1120 fft1=4.135661e-02 fft2=3.713870e-02 total=8.580216e-02
np=16 ny=4096 csr0=1120 fft1=4.079427e-02 fft2=3.713612e-02 total=8.499984e-02
np=16 ny=4096 csr0=1120 fft1=4.102695e-02 fft2=3.716617e-02 total=8.545520e-02
np=16 ny=4096 csr0=1120 fft1=4.100230e-02 fft2=3.788876e-02 total=8.601275e-02
np=16 ny=4096 csr0=1120 fft1=4.094814e-02 fft2=3.734853e-02 total=8.544561e-02
When testing the FFTW in a dummy experiment, the scaling is not very bad with 4096x4096 points:
But in the simulation, as the number of processes increases, the FFT execution doesn't decrease proportionally.