mabarnes / moment_kinetics

Other
2 stars 4 forks source link

Distributed memory bug affecting 2D simulations #137

Open mrhardman opened 9 months ago

mrhardman commented 9 months ago

Using commit 3b3af85351bbb9e94185dea3030fb81eb7956f2a of https://github.com/mabarnes/moment_kinetics/tree/generalised-chodura-diagnostic, I observe behaviour on a HPC cluster that suggests a bug in the distributed memory MPI. Input files for the tests described are in the attached .zip file.

For simulation wall-bc_cheb_2D1V_constant_source_small, the solver appears to crash with NaNs in the output:

$ tail wall-bc_cheb_2D1V_constant_source_small/slurm-4552646.out
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
finished time step 33000    11:44:56
writing distribution functions at step 33000  11:44:56
finished file io         11:44:57
finished runs/wall-bc_cheb_2D1V_constant_source_small.toml Thu 28 Sep 11:45:01 BST 2023

Making the simulation local to a single shared memory region appears to let the simulation run without fault:

$ tail wall-bc_cheb_2D1V_constant_source_local_small/slurm-4552596.out
finished time step 397000   14:34:18
writing distribution functions at step 397000  14:34:18
finished time step 398000   14:34:45
writing distribution functions at step 398000  14:34:45
finished time step 399000   14:35:12
writing distribution functions at step 399000  14:35:12
finished time step 400000   14:35:39
writing distribution functions at step 400000  14:35:39
finished file io         14:35:59
finished runs/wall-bc_cheb_2D1V_constant_source_local_small.toml Thu 28 Sep 14:36:25 BST 2023

Parallelising over the z dimension also appears to be safe:

$ tail wall-bc_cheb_2D1V_constant_source_smallz/slurm-4552856.out
finished time step 397000   14:49:57
writing distribution functions at step 397000  14:49:57
finished time step 398000   14:50:08
writing distribution functions at step 398000  14:50:08
finished time step 399000   14:50:18
writing distribution functions at step 399000  14:50:18
finished time step 400000   14:50:29
writing distribution functions at step 400000  14:50:29
finished file io         14:50:31
finished runs/wall-bc_cheb_2D1V_constant_source_smallz.toml Thu 28 Sep 14:50:46 BST 2023

Whereas parallelising over r leads to issues:

$ tail wall-bc_cheb_2D1V_constant_source_smallr/slurm-4552853.out
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
Found NaN, stopping simulation
finished time step 46000    13:50:29
writing distribution functions at step 46000  13:50:29
finished file io         13:50:30
finished runs/wall-bc_cheb_2D1V_constant_source_smallr.toml Thu 28 Sep 13:50:33 BST 2023

Setting rhostar = 0 allows the simulation to run successfully

$ tail wall-bc_cheb_2D1V_constant_source_smallr_rhostar0/slurm-4557971.out
finished time step 397000   10:34:55
writing distribution functions at step 397000  10:34:55
finished time step 398000   10:35:09
writing distribution functions at step 398000  10:35:09
finished time step 399000   10:35:23
writing distribution functions at step 399000  10:35:23
finished time step 400000   10:35:37
writing distribution functions at step 400000  10:35:37
finished file io         10:35:39
finished runs/wall-bc_cheb_2D1V_constant_source_smallr_rhostar0.toml Fri 29 Sep 10:35:49 BST 2023

debugging.zip