Open goutham-kuncham opened 5 months ago
could you rerun the test with
export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=^uct
and see if the test passes?
I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?
If that simple error case is not what's happening, I'll try to replicate.
could you rerun the test with
export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=^uct
and see if the test passes?
@hppritcha I am getting the same error even after exporting the above mentioned env variables.
I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?
If that simple error case is not what's happening, I'll try to replicate.
@abouteiller Yes, I do have enough slots for spawn. Below is my hostfile.
$ cat hosts
mi100-05 slots=64
I also tested with --map-by node:OVERSUBSCRIBE
but that didn't resolve this issue. Below is the command that I used.
mpirun -n 4 -hostfile hosts --with-ft ulfm --map-by node:OVERSUBSCRIBE ./ulmf-commspawn-bug
Reproducer code works fine if I skip using --with-ft ulfm
. But I wanted to use ulfm for other tasks. So I cannot skip the flag for my code.
$ mpirun -n 4 -hostfile hosts ./ulmf-commspawn-bug
Child
Child
Parent
Parent
Parent
Parent
I don't think we're looking at the right issue here, as the root cause is not related to spawn or anything related to dynamic processing but to a selection of the collective algorithm for the iagree operation.
@goutham-kuncham Is this patch fixing your problem ?
diff --git a/ompi/mca/coll/base/coll_base_comm_select.c b/ompi/mca/coll/base/coll_base_comm_select.c
index e67aab62c7..f4a15bc9d8 100644
--- a/ompi/mca/coll/base/coll_base_comm_select.c
+++ b/ompi/mca/coll/base/coll_base_comm_select.c
@@ -327,8 +327,8 @@ int mca_coll_base_comm_select(ompi_communicator_t * comm)
CHECK_NULL(which_func, comm, scatter_init) ||
CHECK_NULL(which_func, comm, scatterv_init) ||
#if OPAL_ENABLE_FT_MPI
- CHECK_NULL(which_func, comm, agree) ||
- CHECK_NULL(which_func, comm, iagree) ||
+ ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, agree)) ||
+ ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, iagree)) ||
#endif /* OPAL_ENABLE_FT_MPI */
CHECK_NULL(which_func, comm, reduce_local) ) {
/* TODO -- Once the topology flags are set before coll_select then
in ompi/coll/mca/ftagree/coll_ftagree_module.c line 130 we do not set an iera_inter function. We do not have an implementation for that function.
Sorry for the delay in response.
@bosilca Thanks for the patch. I pulled the latest main branch and applied it. The sample reproducer code now spawns the ranks properly, but sometimes I encounter the following PMIX_ERROR
:
[a100-08.cluster:3085211] PMIX ERROR: PMIX_ERROR in file prted/pmix/pmix_server_dyn.c at line 1095
[Parent rank 0] on Node a100-08.cluster.
[Spawned Rank 1] on Node a100-08.cluster.
[Spawned Rank 3] on Node a100-08.cluster.
[Spawned Rank 0] on Node a100-08.cluster.
[Spawned Rank 2] on Node a100-08.cluster.
Segmentation fault (core dumped)
Additionally, when I try to use MPI_Intercomm_merge and perform MPI_Allreduce on the resultant intra-communicator, I get the following error:
[a100-03.cluster:1692801] PMIX ERROR: PMIX_ERROR in file prted/pmix/pmix_server_dyn.c at line 1095
[Parent rank 1] on Node a100-03.cluster.
[Parent rank 0] on Node a100-03.cluster.
[Spawned Rank 0] on Node a100-03.cluster.
[Spawned Rank 2] on Node a100-03.cluster.
[Spawned Rank 1] on Node a100-04.cluster.
[Spawned Rank 3] on Node a100-04.cluster.
Allreduce result is 6
[a100-03:1692814:0:1692814] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc07368)
[a100-03:1692826:0:1692826] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1eca868)
==== backtrace (tid:1692826) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x000000000029df77 mca_pml_ob1_recv_frag_callback_ack() ???:0
2 0x00000000000e0d24 mca_btl_uct_am_handler() ???:0
3 0x000000000004ea1c uct_dc_mlx5_ep_check() ???:0
4 0x00000000000df692 mca_btl_uct_tl_progress.isra.0.part.1() btl_uct_component.c:0
5 0x00000000000dfa85 mca_btl_uct_component_progress() btl_uct_component.c:0
6 0x000000000002526c opal_progress() ???:0
7 0x0000000000090bed ompi_request_default_wait_all() ???:0
8 0x000000000007f4e8 ompi_dpm_dyn_finalize() ???:0
9 0x00000000000634a3 ompi_comm_finalize() comm_init.c:0
10 0x000000000002f4fa opal_finalize_cleanup_domain() ???:0
11 0x0000000000025bff opal_finalize() ???:0
12 0x00000000000965eb ompi_rte_finalize() ???:0
13 0x0000000000099094 ompi_mpi_instance_finalize_common() instance.c:0
14 0x000000000009a5d5 ompi_mpi_instance_finalize() ???:0
15 0x00000000000927cf ompi_mpi_finalize() ???:0
16 0x0000000000400f73 main() ???:0
17 0x000000000003acf3 __libc_start_main() ???:0
18 0x0000000000400bbe _start() ???:0
=================================
free(): invalid pointer
Aborted (core dumped)
I also attempted to run the code with ompi-v5.0.5, But, I couldn't apply the provided patch to v5.0.5 as the mca_coll_base_comm_select
function differs from the main branch, resulting in the following error:
[a100-04:1104797:0:1104797] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x141)
[a100-04:1104798:0:1104798] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138e0)
==== backtrace (tid:1104798) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000001776b3 mca_coll_han_comm_create() ???:0
2 0x0000000000164da7 mca_coll_han_bcast_intra() ???:0
3 0x0000000000074792 ompi_dpm_connect_accept() ???:0
4 0x000000000007d962 ompi_dpm_dyn_init() ???:0
5 0x00000000000905e2 ompi_mpi_init() ???:0
6 0x00000000000c1ebe MPI_Init() ???:0
7 0x0000000000400b2c main() ???:0
8 0x000000000003acf3 __libc_start_main() ???:0
9 0x00000000004009ae _start() ???:0
=================================
@abouteiller Yeah, I also noticed that the implementation of the mca_coll_ftagree_iera_inter
function is missing.
Any guidance on resolving these issues would be greatly appreciated!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Rocky Linux 8.5
x86_64 (AMD EPYC 7713 64-Core Processor)
Infiniband (But Able to reproduce error with Intra-node)
Details of the problem
I am encountering an error when attempting to spawn new ranks using the MPI_Comm_spawn API with User-Level Failure Mitigation (ULFM) enabled.
Reproducer Code:
Steps to reproduce
Expected output
Here in this code we are launching 4 ranks using launcher and trying to spawn 2 more ranks using MPI_Comm_spawn API
Current output / Error
Note: I am unable to reproduce this issue with earlier OMPI versions.
v5.0.2 tag