open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

Error when using MPI_Comm_spawn with ULFM enabled #12585

Open goutham-kuncham opened 5 months ago

goutham-kuncham commented 5 months ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Branch: main
Hash: 42c744e00eba2da1f904d2b94f33d2769e744867

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From git clone https://github.com/open-mpi/ompi.git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 213956cf00ff164230de06b9887a9412bf1e1dad 3rd-party/openpmix (v1.1.3-4027-g213956cf)
 1d867e84981077bffda9ad9d44ff415a3f6d91c4 3rd-party/prrte (psrvr-v2.0.0rc1-4783-g1d867e8498)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

I am encountering an error when attempting to spawn new ranks using the MPI_Comm_spawn API with User-Level Failure Mitigation (ULFM) enabled.

Reproducer Code:

/*
> mpicc ulmf-commspawn-bug.c -o ulmf-commspawn-bug
> mpirun -n <start size> -hostfile hosts --with-ft ulfm ulmf-commspawn-bug
*/
#include "mpi.h"
#include <stdio.h>
#include <signal.h>
#include <stdlib.h>

int main( int argc, char *argv[] )
{
    int isChild = 0;
    int rank, size;
    MPI_Comm parentcomm, intercomm;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int NUM_SPAWNS = 2;
    int errcodes[NUM_SPAWNS];

    MPI_Comm_get_parent(&parentcomm);
    if (parentcomm == MPI_COMM_NULL)
    {
        MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NUM_SPAWNS, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
        printf("Parent\n");
    }
    else
    {
        MPI_Comm_get_parent(&intercomm);
        printf("Child\n");
    }

    fflush(stdout);
    MPI_Finalize();
    return 0;
}

Steps to reproduce

  1. Config using below command:
    ./configure --prefix=$PWD/build --with-ft=ulfm
  2. Compile the attached reproducer code using below command:
    mpicc ulmf-commspawn-bug.c -o ulmf-commspawn-bug
  3. Run the program using below command
    mpirun -n 4 -hostfile hosts --with-ft ulfm ./ulmf-commspawn-bug

    Expected output

    Here in this code we are launching 4 ranks using launcher and trying to spawn 2 more ranks using MPI_Comm_spawn API

    Parent
    Parent
    Parent
    Parent
    Child
    Child 

Current output / Error

--------------------------------------------------------------------------
Although some coll components are available on your system, none of
them said that they could be used for iagree on a new communicator.

This is extremely unusual -- either the "basic", "libnbc" or "self" components
should be able to be chosen for any communicator.  As such, this
likely means that something else is wrong (although you should double
check that the "basic", "libnbc" and "self" coll components are available on
your system -- check the output of the "ompi_info" command).
A coll module failed to finalize properly when a communicator that was
using it was destroyed.

This is somewhat unusual: the module itself may be at fault, or this
may be a symptom of another issue (e.g., a memory problem).
--------------------------------------------------------------------------
[a100-04:00000] *** An error occurred in MPI_Comm_spawn
[a100-04:00000] *** reported by process [3177512961,1]
[a100-04:00000] *** on communicator MPI_COMM_WORLD
[a100-04:00000] *** MPI_ERR_INTERN: internal error
[a100-04:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a100-04:00000] ***    and MPI will try to terminate your MPI job as well)
[a100-04:00000] *** An error occurred in MPI_Init
[a100-04:00000] *** reported by process [3177512962,1]
[a100-04:00000] *** on a NULL communicator
[a100-04:00000] *** Unknown error
[a100-04:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a100-04:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

Note: I am unable to reproduce this issue with earlier OMPI versions. v5.0.2 tag

hppritcha commented 5 months ago

could you rerun the test with

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=^uct

and see if the test passes?

abouteiller commented 5 months ago

I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?

If that simple error case is not what's happening, I'll try to replicate.

goutham-kuncham commented 5 months ago

could you rerun the test with

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=^uct

and see if the test passes?

@hppritcha I am getting the same error even after exporting the above mentioned env variables.

I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?

If that simple error case is not what's happening, I'll try to replicate.

@abouteiller Yes, I do have enough slots for spawn. Below is my hostfile.

$ cat hosts
mi100-05 slots=64

I also tested with --map-by node:OVERSUBSCRIBE but that didn't resolve this issue. Below is the command that I used.

mpirun -n 4 -hostfile hosts --with-ft ulfm --map-by node:OVERSUBSCRIBE ./ulmf-commspawn-bug

Additional observations:

Reproducer code works fine if I skip using --with-ft ulfm. But I wanted to use ulfm for other tasks. So I cannot skip the flag for my code.

$ mpirun -n 4 -hostfile hosts ./ulmf-commspawn-bug
Child
Child
Parent
Parent
Parent
Parent
bosilca commented 5 months ago

I don't think we're looking at the right issue here, as the root cause is not related to spawn or anything related to dynamic processing but to a selection of the collective algorithm for the iagree operation.

bosilca commented 5 months ago

@goutham-kuncham Is this patch fixing your problem ?

diff --git a/ompi/mca/coll/base/coll_base_comm_select.c b/ompi/mca/coll/base/coll_base_comm_select.c
index e67aab62c7..f4a15bc9d8 100644
--- a/ompi/mca/coll/base/coll_base_comm_select.c
+++ b/ompi/mca/coll/base/coll_base_comm_select.c
@@ -327,8 +327,8 @@ int mca_coll_base_comm_select(ompi_communicator_t * comm)
         CHECK_NULL(which_func, comm, scatter_init) ||
         CHECK_NULL(which_func, comm, scatterv_init) ||
 #if OPAL_ENABLE_FT_MPI
-        CHECK_NULL(which_func, comm, agree) ||
-        CHECK_NULL(which_func, comm, iagree) ||
+        ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, agree)) ||
+        ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, iagree)) ||
 #endif  /* OPAL_ENABLE_FT_MPI */
         CHECK_NULL(which_func, comm, reduce_local) ) {
         /* TODO -- Once the topology flags are set before coll_select then
abouteiller commented 5 months ago

in ompi/coll/mca/ftagree/coll_ftagree_module.c line 130 we do not set an iera_inter function. We do not have an implementation for that function.

goutham-kuncham commented 3 months ago

Sorry for the delay in response.

@bosilca Thanks for the patch. I pulled the latest main branch and applied it. The sample reproducer code now spawns the ranks properly, but sometimes I encounter the following PMIX_ERROR:

[a100-08.cluster:3085211] PMIX ERROR: PMIX_ERROR in file prted/pmix/pmix_server_dyn.c at line 1095
[Parent rank 0] on Node a100-08.cluster.
[Spawned Rank 1] on Node a100-08.cluster.
[Spawned Rank 3] on Node a100-08.cluster.
[Spawned Rank 0] on Node a100-08.cluster.
[Spawned Rank 2] on Node a100-08.cluster.
Segmentation fault (core dumped)

Additionally, when I try to use MPI_Intercomm_merge and perform MPI_Allreduce on the resultant intra-communicator, I get the following error:

[a100-03.cluster:1692801] PMIX ERROR: PMIX_ERROR in file prted/pmix/pmix_server_dyn.c at line 1095
[Parent rank 1] on Node a100-03.cluster.
[Parent rank 0] on Node a100-03.cluster.
[Spawned Rank 0] on Node a100-03.cluster.
[Spawned Rank 2] on Node a100-03.cluster.
[Spawned Rank 1] on Node a100-04.cluster.
[Spawned Rank 3] on Node a100-04.cluster.
Allreduce result is 6
[a100-03:1692814:0:1692814] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc07368)
[a100-03:1692826:0:1692826] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1eca868)

==== backtrace (tid:1692826) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000029df77 mca_pml_ob1_recv_frag_callback_ack()  ???:0
 2 0x00000000000e0d24 mca_btl_uct_am_handler()  ???:0
 3 0x000000000004ea1c uct_dc_mlx5_ep_check()  ???:0
 4 0x00000000000df692 mca_btl_uct_tl_progress.isra.0.part.1()  btl_uct_component.c:0
 5 0x00000000000dfa85 mca_btl_uct_component_progress()  btl_uct_component.c:0
 6 0x000000000002526c opal_progress()  ???:0
 7 0x0000000000090bed ompi_request_default_wait_all()  ???:0
 8 0x000000000007f4e8 ompi_dpm_dyn_finalize()  ???:0
 9 0x00000000000634a3 ompi_comm_finalize()  comm_init.c:0
10 0x000000000002f4fa opal_finalize_cleanup_domain()  ???:0
11 0x0000000000025bff opal_finalize()  ???:0
12 0x00000000000965eb ompi_rte_finalize()  ???:0
13 0x0000000000099094 ompi_mpi_instance_finalize_common()  instance.c:0
14 0x000000000009a5d5 ompi_mpi_instance_finalize()  ???:0
15 0x00000000000927cf ompi_mpi_finalize()  ???:0
16 0x0000000000400f73 main()  ???:0
17 0x000000000003acf3 __libc_start_main()  ???:0
18 0x0000000000400bbe _start()  ???:0
=================================
free(): invalid pointer
Aborted (core dumped)

I also attempted to run the code with ompi-v5.0.5, But, I couldn't apply the provided patch to v5.0.5 as the mca_coll_base_comm_select function differs from the main branch, resulting in the following error:

[a100-04:1104797:0:1104797] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x141)
[a100-04:1104798:0:1104798] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x138e0)
==== backtrace (tid:1104798) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000001776b3 mca_coll_han_comm_create()  ???:0
 2 0x0000000000164da7 mca_coll_han_bcast_intra()  ???:0
 3 0x0000000000074792 ompi_dpm_connect_accept()  ???:0
 4 0x000000000007d962 ompi_dpm_dyn_init()  ???:0
 5 0x00000000000905e2 ompi_mpi_init()  ???:0
 6 0x00000000000c1ebe MPI_Init()  ???:0
 7 0x0000000000400b2c main()  ???:0
 8 0x000000000003acf3 __libc_start_main()  ???:0
 9 0x00000000004009ae _start()  ???:0
=================================

@abouteiller Yeah, I also noticed that the implementation of the mca_coll_ftagree_iera_inter function is missing.

Any guidance on resolving these issues would be greatly appreciated!