rodarima / cpic

Particle in Cell simulation of plasma in C
GNU General Public License v3.0
1 stars 1 forks source link

Inconsistent scaling of FFTW #16

Closed rodarima closed 4 years ago

rodarima commented 4 years ago

When testing the FFTW in a dummy experiment, the scaling is not very bad with 4096x4096 points:

np=1 n=4096 mean=0.148971 std=0.00015471 sem=6.91884e-05
np=2 n=4096 mean=0.205583 std=0.000514282 sem=0.000229994
np=4 n=4096 mean=0.104661 std=0.00162727 sem=0.000727735
np=8 n=4096 mean=0.064036 std=0.00103701 sem=0.000463764
np=16 n=4096 mean=0.0428545 std=0.000838381 sem=0.000374935

But in the simulation, as the number of processes increases, the FFT execution doesn't decrease proportionally.

All time in seconds, only the first iteration. fftf=fft forward, fftr=fft reverse
np=1 Solver fftf/fftr/comp/total: 1.490792e-01 / 1.433561e-01 / 7.316658e-01 / 1.303466e+01
np=2 Solver fftf/fftr/comp/total: 2.164974e-01 / 2.000124e-01 / 6.416426e-01 / 7.061025e+00
np=4 Solver fftf/fftr/comp/total: 1.755040e-01 / 1.683762e-01 / 5.648114e-01 / 9.163356e+00
np=8 Solver fftf/fftr/comp/total: 1.804677e-01 / 1.837146e-01 / 5.759790e-01 / 1.688759e+01
np=16 Solver fftf/fftr/comp/total: 2.637597e-01 / 2.404443e-01 / 6.811108e-01 / 3.733283e+01
rodarima commented 4 years ago

Hypothesis:

Correlation found, thanks to Kevin: Disabling OmpSs-2 removing the -fompss-2 flag, produces similar results as the test.

rodarima commented 4 years ago

Further testing reveals that as soon as tasks are enabled with OmpSs-2, either with mcc --ompss-2 or with clang -fompss-2, the fft doesn't scale.

Testing in MN4 with intel mpi works fine with mcc, I couldn't test with openmpi+mcc as is missing, and I don't have all build tools required to build it myself. In any case, doesn't seem to be only related with the clang compiler.

rodarima commented 4 years ago

Affinity is messing the computation time

xeon07% mpirun -n 4 ./bad
sched_getaffinity = 10000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000100000000000000000000000000000000000000000
sched_getaffinity = 00000000000001000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000010
np=4 n=4096 mean=0.112391 std=0.00107304 sem=0.000195909 csr=1FA0
mpirun -n 4 ./bad  18.94s user 1.74s system 395% cpu 5.230 total
xeon07% mpirun -n 4 ./bad
sched_getaffinity = 00000000000000000000000000000000000000000100000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000001
sched_getaffinity = 10000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000000000001
np=4 n=4096 mean=0.239254 std=0.0090374 sem=0.00165 csr=1FA0
mpirun -n 4 ./bad  34.15s user 1.45s system 300% cpu 11.858 total
rodarima commented 4 years ago

Running with mpirun --bind-to core -n 8 ./fft fixes the affinity problem, and results in similar computation times when running with and without -fompss-2

Without tasks:
nproc=1 N=4096 runs=10 mean=0.177037 std=0.0205196
nproc=2 N=4096 runs=10 mean=0.230461 std=0.0128447
nproc=4 N=4096 runs=10 mean=0.135661 std=0.0108876
nproc=8 N=4096 runs=10 mean=0.077069 std=0.00609838
nproc=16 N=4096 runs=10 mean=0.0442516 std=0.00145305

With tasks
nproc=1 N=4096 runs=10 mean=0.177632 std=0.0206584
nproc=2 N=4096 runs=10 mean=0.230245 std=0.00825839
nproc=4 N=4096 runs=10 mean=0.137096 std=0.010929
nproc=8 N=4096 runs=10 mean=0.0700888 std=0.000814367
nproc=16 N=4096 runs=10 mean=0.0431439 std=0.000877571
rodarima commented 4 years ago

Inconsistent scaling persists with 8 and 16 processes. The MXCSR is different, example for 2 processes: MXCSR = 1F80 MXCSR = 1120

rodarima commented 4 years ago

First results with 2 to 16 processes and 8192x8192 points

xeon07% mpirun --bind-to core -n 2 ./cpic conf/simd.conf
Using MPI with 2 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=2 ny=8192 csr=1FA0 fft1=8.604317e-01 fft2=8.260788e-01 total=1.871545e+00
end sim_pre_step
2.682995e+01 init-time
Simulation runs now
np=2 ny=8192 csr=1FA0 fft1=8.640430e-01 fft2=8.292589e-01 total=1.878219e+00
np=2 ny=8192 csr=1FA0 fft1=8.672197e-01 fft2=8.400842e-01 total=1.893966e+00
np=2 ny=8192 csr=1FA0 fft1=8.669346e-01 fft2=8.374721e-01 total=1.891243e+00
np=2 ny=8192 csr=1FA0 fft1=8.614438e-01 fft2=8.272335e-01 total=1.873965e+00
np=2 ny=8192 csr=1FA0 fft1=8.671607e-01 fft2=8.368296e-01 total=1.890961e+00
np=2 ny=8192 csr=1FA0 fft1=8.641553e-01 fft2=8.306946e-01 total=1.879722e+00
np=2 ny=8192 csr=1FA0 fft1=8.671189e-01 fft2=8.367603e-01 total=1.890160e+00
np=2 ny=8192 csr=1FA0 fft1=8.628957e-01 fft2=8.258258e-01 total=1.873778e+00
np=2 ny=8192 csr=1FA0 fft1=8.623579e-01 fft2=8.264706e-01 total=1.873419e+00
np=2 ny=8192 csr=1FA0 fft1=8.678141e-01 fft2=8.372426e-01 total=1.891370e+00
Simulation ends
mpirun --bind-to core -n 2 ./cpic conf/simd.conf  102.85s user 6.89s system 200% cpu 54.795 total
xeon07% mpirun --bind-to core -n 4 ./cpic conf/simd.conf
Using MPI with 4 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=4 ny=8192 csr=1FA0 fft1=4.968215e-01 fft2=4.366530e-01 total=1.025943e+00
end sim_pre_step
1.305764e+01 init-time
Simulation runs now
np=4 ny=8192 csr=1FA0 fft1=4.888453e-01 fft2=4.234810e-01 total=1.005508e+00
np=4 ny=8192 csr=1FA0 fft1=4.778294e-01 fft2=4.272914e-01 total=9.985807e-01
np=4 ny=8192 csr=1FA0 fft1=4.761114e-01 fft2=4.492657e-01 total=1.017983e+00
np=4 ny=8192 csr=1FA0 fft1=4.790637e-01 fft2=4.263159e-01 total=9.990394e-01
np=4 ny=8192 csr=1FA0 fft1=4.762635e-01 fft2=4.232742e-01 total=9.916762e-01
np=4 ny=8192 csr=1FA0 fft1=4.783264e-01 fft2=4.262032e-01 total=9.997419e-01
np=4 ny=8192 csr=1FA0 fft1=4.766619e-01 fft2=4.240529e-01 total=9.929669e-01
np=4 ny=8192 csr=1FA0 fft1=4.761279e-01 fft2=4.242295e-01 total=9.937934e-01
np=4 ny=8192 csr=1FA0 fft1=4.756417e-01 fft2=4.246880e-01 total=9.925117e-01
np=4 ny=8192 csr=1FA0 fft1=4.787048e-01 fft2=4.260658e-01 total=9.984763e-01
Simulation ends
mpirun --bind-to core -n 4 ./cpic conf/simd.conf  104.00s user 6.91s system 399% cpu 27.768 total
xeon07% mpirun --bind-to core -n 8 ./cpic conf/simd.conf
Using MPI with 8 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000010000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000010000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=8 ny=8192 csr=1FA0 fft1=2.879827e-01 fft2=2.553473e-01 total=5.943086e-01
end sim_pre_step
7.186683e+00 init-time
Simulation runs now
np=8 ny=8192 csr=1FA0 fft1=2.859903e-01 fft2=2.556312e-01 total=6.012210e-01
np=8 ny=8192 csr=1FA0 fft1=2.851353e-01 fft2=2.544299e-01 total=5.975852e-01
np=8 ny=8192 csr=1FA0 fft1=2.854144e-01 fft2=2.559104e-01 total=6.019039e-01
np=8 ny=8192 csr=1FA0 fft1=2.916227e-01 fft2=2.611693e-01 total=6.062794e-01
np=8 ny=8192 csr=1FA0 fft1=2.895605e-01 fft2=2.614378e-01 total=6.013092e-01
np=8 ny=8192 csr=1FA0 fft1=2.887576e-01 fft2=2.613469e-01 total=6.011911e-01
np=8 ny=8192 csr=1FA0 fft1=2.901283e-01 fft2=2.623049e-01 total=6.029805e-01
np=8 ny=8192 csr=1FA0 fft1=2.896093e-01 fft2=2.622759e-01 total=6.026710e-01
np=8 ny=8192 csr=1FA0 fft1=2.923282e-01 fft2=2.630958e-01 total=6.064110e-01
np=8 ny=8192 csr=1FA0 fft1=2.900353e-01 fft2=2.628494e-01 total=6.043966e-01
Simulation ends
mpirun --bind-to core -n 8 ./cpic conf/simd.conf  119.37s user 7.39s system 795% cpu 15.939 total
xeon07% mpirun --bind-to core -n 16 ./cpic conf/simd.conf
Using MPI with 16 processors
No output path specified, output will not be saved
sched_getaffinity = 00000100000000000000000000000000000000000000000000000000
sched_getaffinity = 00000010000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000001000000000000000000000000000000000000000000000000
sched_getaffinity = 00001000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000010000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000001000000
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000010000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000100000000
sched_getaffinity = 00000000000000000000000000000000000000000000001000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
begin sim_pre_step
np=16 ny=8192 csr=1FA0 fft1=1.903229e-01 fft2=3.527895e-01 total=5.723643e-01
end sim_pre_step
5.714358e+00 init-time
Simulation runs now
np=16 ny=8192 csr=1FA0 fft1=1.879377e-01 fft2=3.536457e-01 total=5.720859e-01
np=16 ny=8192 csr=1FA0 fft1=1.918415e-01 fft2=3.497640e-01 total=5.699190e-01
np=16 ny=8192 csr=1FA0 fft1=1.905309e-01 fft2=3.525956e-01 total=5.712501e-01
np=16 ny=8192 csr=1FA0 fft1=1.912817e-01 fft2=3.489942e-01 total=5.702936e-01
np=16 ny=8192 csr=1FA0 fft1=1.910022e-01 fft2=3.506647e-01 total=5.722780e-01
np=16 ny=8192 csr=1FA0 fft1=1.873093e-01 fft2=3.507568e-01 total=5.670942e-01
np=16 ny=8192 csr=1FA0 fft1=1.917762e-01 fft2=3.498396e-01 total=5.704971e-01
np=16 ny=8192 csr=1FA0 fft1=1.915747e-01 fft2=3.527666e-01 total=5.745011e-01
np=16 ny=8192 csr=1FA0 fft1=1.882995e-01 fft2=3.483427e-01 total=5.667194e-01
np=16 ny=8192 csr=1FA0 fft1=1.880960e-01 fft2=3.530721e-01 total=5.712555e-01
Simulation ends
mpirun --bind-to core -n 16 ./cpic conf/simd.conf  194.86s user 9.76s system 1583% cpu 12.920 total
xeon07%
rodarima commented 4 years ago

Same for 4096x4096, very similar times as in our example test.

xeon07% mpirun --bind-to core -n 2 ./cpic conf/simd.conf
Using MPI with 2 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=2 ny=4096 csr=1FA0 fft1=2.094326e-01 fft2=1.940707e-01 total=4.498289e-01
end sim_pre_step
7.157842e+00 init-time
Simulation runs now
np=2 ny=4096 csr=1FA0 fft1=2.110798e-01 fft2=1.950352e-01 total=4.531002e-01
np=2 ny=4096 csr=1FA0 fft1=2.092094e-01 fft2=1.935608e-01 total=4.488594e-01
np=2 ny=4096 csr=1FA0 fft1=2.090223e-01 fft2=1.942547e-01 total=4.493191e-01
np=2 ny=4096 csr=1FA0 fft1=2.093447e-01 fft2=1.939311e-01 total=4.492227e-01
np=2 ny=4096 csr=1FA0 fft1=2.094303e-01 fft2=1.941198e-01 total=4.495360e-01
np=2 ny=4096 csr=1FA0 fft1=2.094308e-01 fft2=1.942945e-01 total=4.501178e-01
np=2 ny=4096 csr=1FA0 fft1=2.096134e-01 fft2=1.939866e-01 total=4.495821e-01
np=2 ny=4096 csr=1FA0 fft1=2.113290e-01 fft2=1.949964e-01 total=4.532499e-01
np=2 ny=4096 csr=1FA0 fft1=2.109454e-01 fft2=1.949930e-01 total=4.525338e-01
np=2 ny=4096 csr=1FA0 fft1=2.110181e-01 fft2=1.953466e-01 total=4.529222e-01
Simulation ends
mpirun --bind-to core -n 2 ./cpic conf/simd.conf  26.24s user 1.87s system 199% cpu 14.101 total
xeon07% mpirun --bind-to core -n 4 ./cpic conf/simd.conf
Using MPI with 4 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000010000000000000000000000000000000000000000
sched_getaffinity = 01000000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
begin sim_pre_step
np=4 ny=4096 csr=1FA0 fft1=1.067506e-01 fft2=1.023684e-01 total=2.322819e-01
end sim_pre_step
3.720917e+00 init-time
Simulation runs now
np=4 ny=4096 csr=1FA0 fft1=1.067433e-01 fft2=1.023943e-01 total=2.331056e-01
np=4 ny=4096 csr=1FA0 fft1=1.054682e-01 fft2=1.025266e-01 total=2.312117e-01
np=4 ny=4096 csr=1FA0 fft1=1.038099e-01 fft2=1.022114e-01 total=2.291973e-01
np=4 ny=4096 csr=1FA0 fft1=1.044789e-01 fft2=1.026022e-01 total=2.300590e-01
np=4 ny=4096 csr=1FA0 fft1=1.038262e-01 fft2=1.025734e-01 total=2.292794e-01
np=4 ny=4096 csr=1FA0 fft1=1.057274e-01 fft2=1.013818e-01 total=2.310127e-01
np=4 ny=4096 csr=1FA0 fft1=1.065817e-01 fft2=1.036623e-01 total=2.333994e-01
np=4 ny=4096 csr=1FA0 fft1=1.046149e-01 fft2=1.034002e-01 total=2.309800e-01
np=4 ny=4096 csr=1FA0 fft1=1.042921e-01 fft2=1.021169e-01 total=2.294701e-01
np=4 ny=4096 csr=1FA0 fft1=1.039226e-01 fft2=1.026842e-01 total=2.295370e-01
Simulation ends
mpirun --bind-to core -n 4 ./cpic conf/simd.conf  27.80s user 1.33s system 394% cpu 7.378 total
xeon07% mpirun --bind-to core -n 8 ./cpic conf/simd.conf
Using MPI with 8 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000100000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000100000000000
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
begin sim_pre_step
np=8 ny=4096 csr=1FA0 fft1=6.362851e-02 fft2=5.908889e-02 total=1.355695e-01
end sim_pre_step
2.483188e+00 init-time
Simulation runs now
np=8 ny=4096 csr=1FA0 fft1=6.346041e-02 fft2=5.876337e-02 total=1.347894e-01
np=8 ny=4096 csr=1FA0 fft1=6.184621e-02 fft2=5.904841e-02 total=1.335275e-01
np=8 ny=4096 csr=1FA0 fft1=6.265542e-02 fft2=5.890199e-02 total=1.341035e-01
np=8 ny=4096 csr=1FA0 fft1=6.186986e-02 fft2=5.902857e-02 total=1.335352e-01
np=8 ny=4096 csr=1FA0 fft1=6.166394e-02 fft2=5.767361e-02 total=1.318993e-01
np=8 ny=4096 csr=1FA0 fft1=6.186392e-02 fft2=5.891801e-02 total=1.333808e-01
np=8 ny=4096 csr=1FA0 fft1=6.235275e-02 fft2=5.970215e-02 total=1.347301e-01
np=8 ny=4096 csr=1FA0 fft1=6.189505e-02 fft2=5.983124e-02 total=1.343845e-01
np=8 ny=4096 csr=1FA0 fft1=6.190101e-02 fft2=5.926370e-02 total=1.338398e-01
np=8 ny=4096 csr=1FA0 fft1=6.180064e-02 fft2=5.909344e-02 total=1.335539e-01
Simulation ends
mpirun --bind-to core -n 8 ./cpic conf/simd.conf  35.03s user 1.79s system 778% cpu 4.732 total
xeon07% mpirun --bind-to core -n 16 ./cpic conf/simd.conf
Using MPI with 16 processors
No output path specified, output will not be saved
sched_getaffinity = 00000000000000000000000000000100000000000000000000000000
sched_getaffinity = 00000000000000001000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000001000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000001000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000100000000
sched_getaffinity = 00000000000000000000000000001000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000010000000
sched_getaffinity = 00000000000000000100000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000000001000000
sched_getaffinity = 00001000000000000000000000000000000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000000001000000000
sched_getaffinity = 00000000000000000000000000000010000000000000000000000000
sched_getaffinity = 00000000000000000000000000000000010000000000000000000000
sched_getaffinity = 00000000000000000000000000000000001000000000000000000000
sched_getaffinity = 00000000000000000000000000000000000100000000000000000000
sched_getaffinity = 00000000000000000000000000000000000000000010000000000000
begin sim_pre_step
np=16 ny=4096 csr=1FA0 fft1=4.061115e-02 fft2=3.844893e-02 total=8.648549e-02
end sim_pre_step
2.069708e+00 init-time
Simulation runs now
np=16 ny=4096 csr=1FA0 fft1=4.054738e-02 fft2=3.868207e-02 total=8.682283e-02
np=16 ny=4096 csr=1FA0 fft1=4.016260e-02 fft2=3.850049e-02 total=8.596047e-02
np=16 ny=4096 csr=1FA0 fft1=4.024059e-02 fft2=3.838167e-02 total=8.585370e-02
np=16 ny=4096 csr=1FA0 fft1=4.008793e-02 fft2=3.837994e-02 total=8.564743e-02
np=16 ny=4096 csr=1FA0 fft1=4.018023e-02 fft2=3.823824e-02 total=8.559563e-02
np=16 ny=4096 csr=1FA0 fft1=4.042583e-02 fft2=3.815106e-02 total=8.584279e-02
np=16 ny=4096 csr=1FA0 fft1=4.043940e-02 fft2=3.839743e-02 total=8.613534e-02
np=16 ny=4096 csr=1FA0 fft1=4.032684e-02 fft2=3.829999e-02 total=8.587248e-02
np=16 ny=4096 csr=1FA0 fft1=4.038058e-02 fft2=3.856666e-02 total=8.619559e-02
np=16 ny=4096 csr=1FA0 fft1=4.037662e-02 fft2=3.869482e-02 total=8.648246e-02
Simulation ends
mpirun --bind-to core -n 16 ./cpic conf/simd.conf  50.48s user 4.52s system 1530% cpu 3.594 total
rodarima commented 4 years ago

Changes in the MXCSR register with the feenable family, are only set in the main thread. No significant difference is shown in the computation time for the forward FFT with 4096x4096 points, but is not clear for reverse FFT. The enabled flags are:

        feenableexcept(
                        FE_INVALID      |   
                        FE_DIVBYZERO    |   
                        FE_OVERFLOW     |   
                        FE_UNDERFLOW);

Without exceptions:

np=16 ny=4096 csr0=1FA0 fft1=4.146671e-02 fft2=3.778906e-02 total=8.734963e-02
np=16 ny=4096 csr0=1FA0 fft1=4.126434e-02 fft2=3.781021e-02 total=8.664617e-02
np=16 ny=4096 csr0=1FA0 fft1=4.146285e-02 fft2=3.784594e-02 total=8.671834e-02
np=16 ny=4096 csr0=1FA0 fft1=4.121179e-02 fft2=3.691792e-02 total=8.561996e-02
np=16 ny=4096 csr0=1FA0 fft1=4.083115e-02 fft2=3.747544e-02 total=8.599655e-02
np=16 ny=4096 csr0=1FA0 fft1=4.131030e-02 fft2=3.761803e-02 total=8.674066e-02
np=16 ny=4096 csr0=1FA0 fft1=4.163414e-02 fft2=3.773465e-02 total=8.699309e-02
np=16 ny=4096 csr0=1FA0 fft1=4.097892e-02 fft2=3.723917e-02 total=8.585914e-02
np=16 ny=4096 csr0=1FA0 fft1=4.153574e-02 fft2=3.750067e-02 total=8.673299e-02
np=16 ny=4096 csr0=1FA0 fft1=4.121890e-02 fft2=3.752832e-02 total=8.643588e-02

With exceptions enabled in some threads:

np=16 ny=4096 csr0=1120 fft1=4.212377e-02 fft2=5.838111e-02 total=1.083331e-01
np=16 ny=4096 csr0=1FA0 fft1=4.193584e-02 fft2=5.811156e-02 total=1.072809e-01
np=16 ny=4096 csr0=1FA0 fft1=4.214818e-02 fft2=5.877057e-02 total=1.080332e-01
np=16 ny=4096 csr0=1FA0 fft1=4.194330e-02 fft2=5.846063e-02 total=1.076383e-01
np=16 ny=4096 csr0=1120 fft1=4.212949e-02 fft2=5.843620e-02 total=1.076885e-01
np=16 ny=4096 csr0=1FA0 fft1=4.199488e-02 fft2=5.845302e-02 total=1.075642e-01
np=16 ny=4096 csr0=1120 fft1=4.192160e-02 fft2=5.841398e-02 total=1.074510e-01
np=16 ny=4096 csr0=1FA0 fft1=4.201088e-02 fft2=5.842290e-02 total=1.075703e-01
np=16 ny=4096 csr0=1FA0 fft1=4.211500e-02 fft2=5.817451e-02 total=1.073887e-01
np=16 ny=4096 csr0=1FA0 fft1=4.180829e-02 fft2=5.890636e-02 total=1.078599e-01

Another run:

np=16 ny=4096 csr0=1FA0 fft1=4.000103e-02 fft2=3.884999e-02 total=8.603951e-02
np=16 ny=4096 csr0=1120 fft1=4.000985e-02 fft2=3.828755e-02 total=8.554118e-02
np=16 ny=4096 csr0=1FA0 fft1=4.002022e-02 fft2=3.831978e-02 total=8.548526e-02
np=16 ny=4096 csr0=1120 fft1=4.005703e-02 fft2=3.841551e-02 total=8.565850e-02
np=16 ny=4096 csr0=1120 fft1=3.994538e-02 fft2=3.839478e-02 total=8.552366e-02
np=16 ny=4096 csr0=1FA0 fft1=4.000821e-02 fft2=3.823638e-02 total=8.537884e-02
np=16 ny=4096 csr0=1120 fft1=3.976115e-02 fft2=3.889000e-02 total=8.576684e-02
np=16 ny=4096 csr0=1120 fft1=3.994265e-02 fft2=3.827638e-02 total=8.534332e-02
np=16 ny=4096 csr0=1120 fft1=4.014546e-02 fft2=3.845588e-02 total=8.580696e-02
np=16 ny=4096 csr0=1FA0 fft1=3.987036e-02 fft2=3.871251e-02 total=8.585445e-02

With enabled exceptions in all processes:

np=16 ny=4096 csr0=1120 fft1=4.131929e-02 fft2=3.748747e-02 total=8.680457e-02
np=16 ny=4096 csr0=1120 fft1=4.117152e-02 fft2=3.721671e-02 total=8.545117e-02
np=16 ny=4096 csr0=1120 fft1=4.146759e-02 fft2=3.723733e-02 total=8.581670e-02
np=16 ny=4096 csr0=1120 fft1=4.116673e-02 fft2=3.797847e-02 total=8.636267e-02
np=16 ny=4096 csr0=1120 fft1=4.133027e-02 fft2=3.742653e-02 total=8.581166e-02
np=16 ny=4096 csr0=1120 fft1=4.135661e-02 fft2=3.713870e-02 total=8.580216e-02
np=16 ny=4096 csr0=1120 fft1=4.079427e-02 fft2=3.713612e-02 total=8.499984e-02
np=16 ny=4096 csr0=1120 fft1=4.102695e-02 fft2=3.716617e-02 total=8.545520e-02
np=16 ny=4096 csr0=1120 fft1=4.100230e-02 fft2=3.788876e-02 total=8.601275e-02
np=16 ny=4096 csr0=1120 fft1=4.094814e-02 fft2=3.734853e-02 total=8.544561e-02