Closed mrhardman closed 3 weeks ago
This error also prevents me from running
julia> include("moment_kinetics/test/fokker_planck_tests.jl")
Without seeing the same error. I guess that some MPI command must now be called before we can start a test, but I haven't found it yet.
However, this error does not appear to affect test_rosenbluth_potentials_direct_integration()
from test_scripts/fkpl_direct_integration_test.jl
, which appears to give normal output.
@johnomotani Are you able to reproduce the bug? This bug should make the Fokker-Planck tests fail if tested in isolation on your machine, without using the rest of the testing framework.
I confirm that this bug appears to happen using both Julia 1.93 and Julia 1.10.2 using different installations, on master and on mms_bugfixes_and_docs.
I have tracked the problem to the ignore_MPI
flag which now features in define_coordinate()
. It looks like it takes the value false
, but my tests always assumed the coordinates could be constructed before setting up any shared-memory functions. The above error is removed if ignore_MPI=true
is passed to define_coordinate()
, but I am not clear whether or not this will break my shared-memory functionality with the Gauss-Legendre grids.
@johnomotani Can we set ignore_MPI
to default to true
? Why do we need this new flag? Unhappily, I see it appears in many modules, so perhaps I have to refactor all my tests?
Now working on fixes in https://github.com/mabarnes/moment_kinetics/tree/fix_collision_operator_scripts.
The primary change seems to be an introduction of shared-memory arrays into the coordinate struct. This change is a departure from the structure of the code for the MPI, since the time that I last developed. Previously, all MPI dummy arrays were stored in the scratch_dummy
struct from time_advance.jl. Overall, this change seems to have introduced lots of unpleasant if ignore_MPI
around the source, which might be justified by whatever problem caused this development. However, the problem is avoided by reordering the initialisation of the coord
and spectral
structs with the initialisation of the distributed memory MPI.
Further problems appear to exist in test_scripts/2D_FEM_assembly_test.jl
(NaNs, Infs, when using MPI).
Running
julia> run_assembly_test(ngrid=3,nelement_list=[4],plot_scan=false)
With 1 and 2 cores gives identical results, but with 4 cores gives NaNs. Is this related to the previous issues with the anyv() parallelisation? Happily,
mpirun -n 16 julia -O3 --project -Jmoment_kinetics.so -e 'include("moment_kinetics/test/fokker_planck_tests.jl")'
gives passing tests, suggesting that the issue is in the script, not the underlying functions.
Further investigation of this bug confirms that, unfortunately, the problem is in the underlying source.
Changing ngrid = 9
to ngrid = 3
here https://github.com/mabarnes/moment_kinetics/blob/869d52e606897890445adf95f11660eab9435d14/moment_kinetics/test/fokker_planck_tests.jl#L216 leads to tests which fail on 4 cores. Inspecting the failures (which are expected), we see that there are Inf values, which do not appear when the same experiment is run on 1 core.
Is it possible that the anyv region is still broken when the number of cores is larger than can be split nicely? @johnomotani
More notes:
running run_assembly_test(ngrid=5,nelement_list=[16],plot_scan=false)
on 1, 2, and 4 cores gives the same results, but on 6 and 8 cores we find different (large, or NaN/Inf) errors. If we reduce the number of elements, we can use fewer cores before noticing errors. I think this is a smoking gun of an shared-memory MPI fault.
Yes, there was a bug in the anyv parallelisation - the array allocation needed some updates after the optimisations in #176. Should be fixed now by #222.
EDIT: I think I now understand the cause of this initial error here, but another error in the Fokker-Planck parallelisation was discovered in the course of debugging this, see the final comment.
Trying to re-run the Fokker-Planck test scripts with plotting capability for the purpose of regenerating data in the ExCALIBUR report, I hit the following MPI error. @johnomotani Do you recall what has changed to cause this error?