open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

Error when using MPI_Comm_spawn with ULFM enabled #12585

Open goutham-kuncham opened 1 month ago

goutham-kuncham commented 1 month ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Branch: main
Hash: 42c744e00eba2da1f904d2b94f33d2769e744867

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From git clone https://github.com/open-mpi/ompi.git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 213956cf00ff164230de06b9887a9412bf1e1dad 3rd-party/openpmix (v1.1.3-4027-g213956cf)
 1d867e84981077bffda9ad9d44ff415a3f6d91c4 3rd-party/prrte (psrvr-v2.0.0rc1-4783-g1d867e8498)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running


Details of the problem

I am encountering an error when attempting to spawn new ranks using the MPI_Comm_spawn API with User-Level Failure Mitigation (ULFM) enabled.

Reproducer Code:

/*
> mpicc ulmf-commspawn-bug.c -o ulmf-commspawn-bug
> mpirun -n <start size> -hostfile hosts --with-ft ulfm ulmf-commspawn-bug
*/
#include "mpi.h"
#include <stdio.h>
#include <signal.h>
#include <stdlib.h>

int main( int argc, char *argv[] )
{
    int isChild = 0;
    int rank, size;
    MPI_Comm parentcomm, intercomm;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int NUM_SPAWNS = 2;
    int errcodes[NUM_SPAWNS];

    MPI_Comm_get_parent(&parentcomm);
    if (parentcomm == MPI_COMM_NULL)
    {
        MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, NUM_SPAWNS, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
        printf("Parent\n");
    }
    else
    {
        MPI_Comm_get_parent(&intercomm);
        printf("Child\n");
    }

    fflush(stdout);
    MPI_Finalize();
    return 0;
}

Steps to reproduce

  1. Config using below command:
    ./configure --prefix=$PWD/build --with-ft=ulfm
  2. Compile the attached reproducer code using below command:
    mpicc ulmf-commspawn-bug.c -o ulmf-commspawn-bug
  3. Run the program using below command
    mpirun -n 4 -hostfile hosts --with-ft ulfm ./ulmf-commspawn-bug

    Expected output

    Here in this code we are launching 4 ranks using launcher and trying to spawn 2 more ranks using MPI_Comm_spawn API

    Parent
    Parent
    Parent
    Parent
    Child
    Child 

Current output / Error

--------------------------------------------------------------------------
Although some coll components are available on your system, none of
them said that they could be used for iagree on a new communicator.

This is extremely unusual -- either the "basic", "libnbc" or "self" components
should be able to be chosen for any communicator.  As such, this
likely means that something else is wrong (although you should double
check that the "basic", "libnbc" and "self" coll components are available on
your system -- check the output of the "ompi_info" command).
A coll module failed to finalize properly when a communicator that was
using it was destroyed.

This is somewhat unusual: the module itself may be at fault, or this
may be a symptom of another issue (e.g., a memory problem).
--------------------------------------------------------------------------
[a100-04:00000] *** An error occurred in MPI_Comm_spawn
[a100-04:00000] *** reported by process [3177512961,1]
[a100-04:00000] *** on communicator MPI_COMM_WORLD
[a100-04:00000] *** MPI_ERR_INTERN: internal error
[a100-04:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a100-04:00000] ***    and MPI will try to terminate your MPI job as well)
[a100-04:00000] *** An error occurred in MPI_Init
[a100-04:00000] *** reported by process [3177512962,1]
[a100-04:00000] *** on a NULL communicator
[a100-04:00000] *** Unknown error
[a100-04:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[a100-04:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

Note: I am unable to reproduce this issue with earlier OMPI versions. v5.0.2 tag

hppritcha commented 1 month ago

could you rerun the test with

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=^uct

and see if the test passes?

abouteiller commented 1 month ago

I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?

If that simple error case is not what's happening, I'll try to replicate.

goutham-kuncham commented 1 month ago

could you rerun the test with

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=^uct

and see if the test passes?

@hppritcha I am getting the same error even after exporting the above mentioned env variables.

I see you are not using oversubscribe, do you have enough slots for the spawns in your host file?

If that simple error case is not what's happening, I'll try to replicate.

@abouteiller Yes, I do have enough slots for spawn. Below is my hostfile.

$ cat hosts
mi100-05 slots=64

I also tested with --map-by node:OVERSUBSCRIBE but that didn't resolve this issue. Below is the command that I used.

mpirun -n 4 -hostfile hosts --with-ft ulfm --map-by node:OVERSUBSCRIBE ./ulmf-commspawn-bug

Additional observations:

Reproducer code works fine if I skip using --with-ft ulfm. But I wanted to use ulfm for other tasks. So I cannot skip the flag for my code.

$ mpirun -n 4 -hostfile hosts ./ulmf-commspawn-bug
Child
Child
Parent
Parent
Parent
Parent
bosilca commented 1 month ago

I don't think we're looking at the right issue here, as the root cause is not related to spawn or anything related to dynamic processing but to a selection of the collective algorithm for the iagree operation.

bosilca commented 1 month ago

@goutham-kuncham Is this patch fixing your problem ?

diff --git a/ompi/mca/coll/base/coll_base_comm_select.c b/ompi/mca/coll/base/coll_base_comm_select.c
index e67aab62c7..f4a15bc9d8 100644
--- a/ompi/mca/coll/base/coll_base_comm_select.c
+++ b/ompi/mca/coll/base/coll_base_comm_select.c
@@ -327,8 +327,8 @@ int mca_coll_base_comm_select(ompi_communicator_t * comm)
         CHECK_NULL(which_func, comm, scatter_init) ||
         CHECK_NULL(which_func, comm, scatterv_init) ||
 #if OPAL_ENABLE_FT_MPI
-        CHECK_NULL(which_func, comm, agree) ||
-        CHECK_NULL(which_func, comm, iagree) ||
+        ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, agree)) ||
+        ((OMPI_COMM_IS_INTRA(comm)) && CHECK_NULL(which_func, comm, iagree)) ||
 #endif  /* OPAL_ENABLE_FT_MPI */
         CHECK_NULL(which_func, comm, reduce_local) ) {
         /* TODO -- Once the topology flags are set before coll_select then
abouteiller commented 1 month ago

in ompi/coll/mca/ftagree/coll_ftagree_module.c line 130 we do not set an iera_inter function. We do not have an implementation for that function.