mabarnes / moment_kinetics

Other
2 stars 4 forks source link

test_scripts MPI communicator bug #221

Closed mrhardman closed 3 weeks ago

mrhardman commented 1 month ago

EDIT: I think I now understand the cause of this initial error here, but another error in the Fokker-Planck parallelisation was discovered in the course of debugging this, see the final comment.

Trying to re-run the Fokker-Planck test scripts with plotting capability for the purpose of regenerating data in the ExCALIBUR report, I hit the following MPI error. @johnomotani Do you recall what has changed to cause this error?

julia> include("test_scripts/2D_FEM_assembly_test.jl")

julia> run_assembly_test()
made inputs
vpa: ngrid: 5 nelement: 16 Lvpa: 12.0
vperp: ngrid: 5 nelement: 8 Lvperp: 6.0
ERROR: MPIError(5): MPI_ERR_COMM: invalid communicator
Stacktrace:
  [1] MPI_Comm_rank
    @ ~/excalibur/moment_kinetics_collisions/.julia/packages/MPI/z2owj/src/api/generated_api.jl:954 [inlined]
  [2] Comm_rank
    @ ~/excalibur/moment_kinetics_collisions/.julia/packages/MPI/z2owj/src/comm.jl:61 [inlined]
  [3] allocate_shared(T::Type, dims::Tuple{Int64}; comm::Nothing, maybe_debug::Bool)
    @ moment_kinetics.communication ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/communication.jl:571
  [4] allocate_shared
    @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/communication.jl:567 [inlined]
  [5] #allocate_shared_float#3
    @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/array_allocation.jl:55 [inlined]
  [6] allocate_shared_float
    @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/array_allocation.jl:54 [inlined]
  [7]
    @ moment_kinetics.coordinates ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/coordinates.jl:155
  [8] define_coordinate (repeats 2 times)
    @ ~/excalibur/moment_kinetics_collisions/moment_kinetics/src/coordinates.jl:119 [inlined]
  [9]
    @ Main ~/excalibur/moment_kinetics_collisions/test_scripts/2D_FEM_assembly_test.jl:115
 [10] test_weak_form_collisions
    @ ~/excalibur/moment_kinetics_collisions/test_scripts/2D_FEM_assembly_test.jl:74 [inlined]
 [11] run_assembly_test(; ngrid::Int64, nelement_list::Vector{…}, plot_scan::Bool, plot_test_output::Bool, use_Maxwellian_Rosenbluth_coefficients::Bool, use_Maxwellian_field_particle_distribution::Bool, test_dense_construction::Bool, test_parallelism::Bool, test_numerical_conserving_terms::Bool, algebraic_solve_for_d2Gdvperp2::Bool, test_self_operator::Bool, Lvpa::Float64, Lvperp::Float64)
    @ Main ~/excalibur/moment_kinetics_collisions/test_scripts/2D_FEM_assembly_test.jl:455
 [12] run_assembly_test()
    @ Main ~/excalibur/moment_kinetics_collisions/test_scripts/2D_FEM_assembly_test.jl:383
 [13] top-level scope
    @ REPL[8]:1
Some type information was truncated. Use `show(err)` to see complete types.
mrhardman commented 1 month ago

This error also prevents me from running

julia> include("moment_kinetics/test/fokker_planck_tests.jl")

Without seeing the same error. I guess that some MPI command must now be called before we can start a test, but I haven't found it yet.

mrhardman commented 1 month ago

However, this error does not appear to affect test_rosenbluth_potentials_direct_integration() from test_scripts/fkpl_direct_integration_test.jl, which appears to give normal output.

@johnomotani Are you able to reproduce the bug? This bug should make the Fokker-Planck tests fail if tested in isolation on your machine, without using the rest of the testing framework.

mrhardman commented 1 month ago

I confirm that this bug appears to happen using both Julia 1.93 and Julia 1.10.2 using different installations, on master and on mms_bugfixes_and_docs.

mrhardman commented 1 month ago

I have tracked the problem to the ignore_MPI flag which now features in define_coordinate(). It looks like it takes the value false, but my tests always assumed the coordinates could be constructed before setting up any shared-memory functions. The above error is removed if ignore_MPI=true is passed to define_coordinate(), but I am not clear whether or not this will break my shared-memory functionality with the Gauss-Legendre grids.

@johnomotani Can we set ignore_MPI to default to true? Why do we need this new flag? Unhappily, I see it appears in many modules, so perhaps I have to refactor all my tests?

mrhardman commented 1 month ago

Now working on fixes in https://github.com/mabarnes/moment_kinetics/tree/fix_collision_operator_scripts.

The primary change seems to be an introduction of shared-memory arrays into the coordinate struct. This change is a departure from the structure of the code for the MPI, since the time that I last developed. Previously, all MPI dummy arrays were stored in the scratch_dummy struct from time_advance.jl. Overall, this change seems to have introduced lots of unpleasant if ignore_MPI around the source, which might be justified by whatever problem caused this development. However, the problem is avoided by reordering the initialisation of the coord and spectral structs with the initialisation of the distributed memory MPI.

mrhardman commented 1 month ago

Further problems appear to exist in test_scripts/2D_FEM_assembly_test.jl (NaNs, Infs, when using MPI).

Running

julia> run_assembly_test(ngrid=3,nelement_list=[4],plot_scan=false)

With 1 and 2 cores gives identical results, but with 4 cores gives NaNs. Is this related to the previous issues with the anyv() parallelisation? Happily,

mpirun -n 16 julia -O3 --project -Jmoment_kinetics.so -e 'include("moment_kinetics/test/fokker_planck_tests.jl")'

gives passing tests, suggesting that the issue is in the script, not the underlying functions.

mrhardman commented 1 month ago

Further investigation of this bug confirms that, unfortunately, the problem is in the underlying source.

Changing ngrid = 9 to ngrid = 3 here https://github.com/mabarnes/moment_kinetics/blob/869d52e606897890445adf95f11660eab9435d14/moment_kinetics/test/fokker_planck_tests.jl#L216 leads to tests which fail on 4 cores. Inspecting the failures (which are expected), we see that there are Inf values, which do not appear when the same experiment is run on 1 core.

Is it possible that the anyv region is still broken when the number of cores is larger than can be split nicely? @johnomotani

More notes: running run_assembly_test(ngrid=5,nelement_list=[16],plot_scan=false) on 1, 2, and 4 cores gives the same results, but on 6 and 8 cores we find different (large, or NaN/Inf) errors. If we reduce the number of elements, we can use fewer cores before noticing errors. I think this is a smoking gun of an shared-memory MPI fault.

johnomotani commented 4 weeks ago

Yes, there was a bug in the anyv parallelisation - the array allocation needed some updates after the optimisations in #176. Should be fixed now by #222.