Open mrhardman opened 9 months ago
Using commit 3b3af85351bbb9e94185dea3030fb81eb7956f2a of https://github.com/mabarnes/moment_kinetics/tree/generalised-chodura-diagnostic, I observe behaviour on a HPC cluster that suggests a bug in the distributed memory MPI. Input files for the tests described are in the attached .zip file.
For simulation wall-bc_cheb_2D1V_constant_source_small, the solver appears to crash with NaNs in the output:
wall-bc_cheb_2D1V_constant_source_small
$ tail wall-bc_cheb_2D1V_constant_source_small/slurm-4552646.out Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation finished time step 33000 11:44:56 writing distribution functions at step 33000 11:44:56 finished file io 11:44:57 finished runs/wall-bc_cheb_2D1V_constant_source_small.toml Thu 28 Sep 11:45:01 BST 2023
Making the simulation local to a single shared memory region appears to let the simulation run without fault:
$ tail wall-bc_cheb_2D1V_constant_source_local_small/slurm-4552596.out finished time step 397000 14:34:18 writing distribution functions at step 397000 14:34:18 finished time step 398000 14:34:45 writing distribution functions at step 398000 14:34:45 finished time step 399000 14:35:12 writing distribution functions at step 399000 14:35:12 finished time step 400000 14:35:39 writing distribution functions at step 400000 14:35:39 finished file io 14:35:59 finished runs/wall-bc_cheb_2D1V_constant_source_local_small.toml Thu 28 Sep 14:36:25 BST 2023
Parallelising over the z dimension also appears to be safe:
z
$ tail wall-bc_cheb_2D1V_constant_source_smallz/slurm-4552856.out finished time step 397000 14:49:57 writing distribution functions at step 397000 14:49:57 finished time step 398000 14:50:08 writing distribution functions at step 398000 14:50:08 finished time step 399000 14:50:18 writing distribution functions at step 399000 14:50:18 finished time step 400000 14:50:29 writing distribution functions at step 400000 14:50:29 finished file io 14:50:31 finished runs/wall-bc_cheb_2D1V_constant_source_smallz.toml Thu 28 Sep 14:50:46 BST 2023
Whereas parallelising over r leads to issues:
r
$ tail wall-bc_cheb_2D1V_constant_source_smallr/slurm-4552853.out Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation Found NaN, stopping simulation finished time step 46000 13:50:29 writing distribution functions at step 46000 13:50:29 finished file io 13:50:30 finished runs/wall-bc_cheb_2D1V_constant_source_smallr.toml Thu 28 Sep 13:50:33 BST 2023
Setting rhostar = 0 allows the simulation to run successfully
rhostar = 0
$ tail wall-bc_cheb_2D1V_constant_source_smallr_rhostar0/slurm-4557971.out finished time step 397000 10:34:55 writing distribution functions at step 397000 10:34:55 finished time step 398000 10:35:09 writing distribution functions at step 398000 10:35:09 finished time step 399000 10:35:23 writing distribution functions at step 399000 10:35:23 finished time step 400000 10:35:37 writing distribution functions at step 400000 10:35:37 finished file io 10:35:39 finished runs/wall-bc_cheb_2D1V_constant_source_smallr_rhostar0.toml Fri 29 Sep 10:35:49 BST 2023
debugging.zip
Using commit 3b3af85351bbb9e94185dea3030fb81eb7956f2a of https://github.com/mabarnes/moment_kinetics/tree/generalised-chodura-diagnostic, I observe behaviour on a HPC cluster that suggests a bug in the distributed memory MPI. Input files for the tests described are in the attached .zip file.
For simulation
wall-bc_cheb_2D1V_constant_source_small
, the solver appears to crash with NaNs in the output:Making the simulation local to a single shared memory region appears to let the simulation run without fault:
Parallelising over the
z
dimension also appears to be safe:Whereas parallelising over
r
leads to issues:Setting
rhostar = 0
allows the simulation to run successfullydebugging.zip