Open aidanheerdegen opened 3 years ago
Hey @marshallward I am just merging this work you did into our FMS fork. Can you just confirm that this is ok, I assume the code is fine, it is what was accepted into the upstream FMS but I guess we had to branch before it was merged into master
.
Not really in a good position to test it out, but it looks OK to me. If it's not breaking your runs then I suspect it's fine to merge.
This is the work of @marshallward
This patch contains three new features for FMS: Support for MPI datatypes, an MPI_Alltoallw interface, and modifications to mpp_global_field to use these changes for select operations.
These changes were primarily made to improve stability of large (>4000 rank) MPI jobs under OpenMPI at NCI.
There are differences in the performance of mpp_global_field, occasionally even very large differences, but there is no consistency across various MPI libraries. One method will be faster in one library, and slower in another, even across MPI versions. Generally, the MPI_Alltoallw method showed improved performance on our system, but this is not a universal result. We therefore introduce a flag to control this feature.
The inclusion of MPI_Type support may also be seen as an opportunity to introduce other new MPI features for other operations, e.g. halo exchange.
Detailed changes are summarised below.
MPI data transfer type ("MPI_Type") support has been added to FMS. This is done with the following features:
mpp_type
derived type has been added, which manages the type details and hides the MPI internals from the model developer. Types are managed inside of an internal linked list,datatypes
.Note: The name
mpp_type
is very similar to the preprocessor variableMPP_TYPE_
and should possibly be renamed to something else, e.g.mpp_datatype
.*mpp_type_create
andmpp_type_free
are used to create and release these types within the MPI library. These append and remove mpp_types from the internal linked list, and include reference counters to manage duplicates.A
mpp_byte
type is created as a module-level variable for default operations.NOTE: As the first element of the list, it also inadvertently provides access to the rest of
datatypes
, which is private, but there is probably some ways to address this.*A MPI_Alltoallw wrapper, using MPI_Types, has been added to the mpp_alltoall interface.
An implementation of mpp_global_field using MPI_Alltoallw and mpp_types has been added. In addition to replacing the point-to-point operations with a collective, it also eliminates the need to use the internal MPP stack.
Since MPI_Alltoallw requires that the input field by contiguous, it is only enabled for data domains (i.e. compute + halo). This limitation can be overcome, either by copying or more careful attention to layout, but it can be addressed in a future patch.
This method is enabled in the
mpp_domains_nml
namelist group, by setting theuse_alltoallw
flag to True.Provisional interfaces to SHMEM and serial ("nocomm") builds have been added, although they are as yet untested and primarily meant as placeholders for now.
This patch also includes the following changes to support this work.
In
get_peset
, the method used to generate MPI subcommunicators has been changed; specificallyMPI_Comm_create
has been replaced withMPI_Comm_create_group
. The former is blocking over all ranks, while the latter is only blocking over ranks in the subgroup.This was done to accommodate IO domains of a single rank, usually due to masking, which would result in no communication and cause a model hang.
It seems that more recent changes in FMS related to handling single-rank communicators were made to avoid this particular scenario from happening, but I still think that it's more correct to use
MPI_Comm_create_group
and have left the change.This is an MPI 3.0 feature, so this might be an issue for older MPI libraries.
Logical interfaces added to mpp_alltoall and mpp_alltoallv
Single-rank PE checks in mpp_alltoall were removed to prevent model hangs with the subcommunicators.
NULL_PE checks have been added to the original point-to-point implementation of mpp_global_field, although these may not be required anymore due to changes in the subcommunicator implementation.
This work was by Nic Hannah, and may actually be part of an existing pull request. (TODO: Check this!)
Timer events have been added to mpp_type_create and mpp_type_free, although they are not yet initialized anywhere.
The diagnostic field count was increased from 150 to 250, to support the current needs of researchers.