Closed wzamazon closed 1 year ago
I did some debug and found the root cause.
It was broken by the following commit
commit 6a406fb3c9ea74aa371bdff82e8eb01bf9820293
Author: Aurelien Bouteiller <bouteill@icl.utk.edu>
Date: Mon Feb 8 22:39:09 2021 -0500
Import ULFM Fault Tolerance
The historical repositories contain the full history and
attribution and are available from
https://bitbucket.org/icldistcomp/ulfm2/src/ulfm/
and prior
https://github.com/ICLDisco/ulfm-legacy
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: Josh Hursey <jjhursey@open-mpi.org>
Signed-off-by: Thomas Herault <herault@icl.utk.edu>
Signed-off-by: Wesley Bland <wbland@icl.utk.edu>
Signed-off-by: Nuria Losada <nlosada@icl.utk.edu>
Signed-off-by: Nathan T. Weeks <weeks@iastate.edu>
Which add fault-tolerent MPI support.
Specifically, it is the following code snippet in the commit broke fortran's MPI_GROUP_EMPTY:
/*
* Allocate a new group structure
@@ -324,6 +337,16 @@ int ompi_group_init(void)
return OMPI_ERROR;
}
+#if OPAL_ENABLE_FT_MPI
+ /* Setup global list of failed processes */
+ ompi_group_all_failed_procs = OBJ_NEW(ompi_group_t);
+ ompi_group_all_failed_procs->grp_proc_count = 0;
+ ompi_group_all_failed_procs->grp_my_rank = MPI_UNDEFINED;
+ ompi_group_all_failed_procs->grp_proc_pointers = NULL;
+ ompi_group_all_failed_procs->grp_flags |= OMPI_GROUP_DENSE;
+ ompi_group_all_failed_procs->grp_flags |= OMPI_GROUP_INTRINSIC;
+#endif
+
/* add MPI_GROUP_NULL to table */
OBJ_CONSTRUCT(&ompi_mpi_group_null, ompi_group_t);
ompi_mpi_group_null.group.grp_proc_count = 0;
ompi_mpi_group_null.group.grp_my_rank = MPI_PROC_NULL;
ompi_mpi_group_null.group.grp_proc_pointers = NULL;
ompi_mpi_group_null.group.grp_flags |= OMPI_GROUP_DENSE;
ompi_mpi_group_null.group.grp_flags |= OMPI_GROUP_INTRINSIC;
/* add MPI_GROUP_EMPTY to table */
OBJ_CONSTRUCT(&ompi_mpi_group_empty, ompi_group_t);
ompi_mpi_group_empty.group.grp_proc_count = 0;
ompi_mpi_group_empty.group.grp_my_rank = MPI_UNDEFINED;
ompi_mpi_group_empty.group.grp_proc_pointers = NULL;
ompi_mpi_group_empty.group.grp_flags |= OMPI_GROUP_DENSE;
ompi_mpi_group_empty.group.grp_flags |= OMPI_GROUP_INTRINSIC;
The reason it broke fortran's MPI_GROUP_EMPTY is because MPI_GROUP_EMPTY in fortran is an integer, whose value is 1. which corresponds to the 2nd element in ompi_group_f_to_c_table
, and the function MPI_Group_f2c is used to convert fortran index to a C pointer. Therefore, for fortran's MPI_GROUP_EMPTY to work, the C object ompi_mpi_group_empty must be the 2nd elements in ompi_group_f_to_c_table
.
However, this sequence was broken by 6a406fb3c9ea74aa371bdff82e8eb01bf9820293, which introduced a new group at the beginning of the ompi_group_f_to_c_table
, Causing fortran's MPI_GROUP_EMPTY to be pointing to a different table.
Opened https://github.com/open-mpi/ompi/pull/11807 to addressed issue, which move the initialization of ompi_group_all_failed_procs to after the empty group.
PR has been merged and backported
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
built from source by mtt
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.22fe51cb7a961b6060fc5c48e659237cbe162566 ../3rd-party/openpmix (v1.1.3-3872-g22fe51cb) ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 ../3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a) c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 ../config/oac (heads/main)
Please describe the system on which you are running
Details of the problem
As can been seen in the result of mtt's intel test suite, all fortran tests that used
MPI_GROUP_EMPTY
is failing. For example: TheMPI_Group_union1_f
test, which is run by following command:Failed with following log:
basically the test was trying to do a union between a user created group and MPI_GROUP_EMPTY, but found MPI_GROUP_EMPTY to be invalid