open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

Is it possible for MPI_Comm_spawn-ed children to outlive their parents in OpenMPI 5? #12867

Closed Petter-Programs closed 2 weeks ago

Petter-Programs commented 2 weeks ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.1.0 (but also tested on 5.0.5)

> git log --pretty=format:'%H' -n 1
f31a9be313a0a6be7db2117050b1b608c115c4d6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone https://github.com/open-mpi/ompi.git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179) 8d0a25b8c9b868bc66de8df3b8abd7f477f07a0f 3rd-party/prrte (psrvr-v2.0.0rc1-4799-g8d0a25b8c9) dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running


Details of the problem

In OpenMPI5, children spawned by MPI_Comm_spawn are terminated once the last process belonging to the parent communicator (original MPI_COMM_WORLD) returns. This is not the behavior in OpenMPI4, where the children can live on past this point.

I am wondering if this is intended behavior, and if there are any way to configure this not to happen. Here's a code snippet to reproduce:

#include <mpi.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#define ROOT 0
#define MAX_PROCS 2
#define EXTRA_CHILD_TIME 10

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    char const child[] = "child";
    char const parent[] = "parent";

    MPI_Comm inter_comm;
    MPI_Comm parent_comm;

    int error_codes[MAX_PROCS];

    int i_am_parent;

    MPI_Comm_get_parent(&parent_comm);

    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    if (parent_comm == MPI_COMM_NULL)
    {
        i_am_parent = 1;

        MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, MAX_PROCS, MPI_INFO_NULL, ROOT, MPI_COMM_WORLD, &inter_comm, error_codes);

        MPI_Comm_free(&inter_comm);

        if (my_rank == ROOT)
        {
            printf("Root parent terminating...\n");
        }
        else
        {
            sleep(5);
            printf("Other parent exiting\n");
        }
    }

    else
    {
        i_am_parent = 0;

        MPI_Comm_free(&parent_comm);

        for (int i = 0; i < EXTRA_CHILD_TIME; i++)
        {
            if (my_rank == ROOT)
            {
                printf("Child sleeping for %d more seconds...\n", EXTRA_CHILD_TIME - i);
            }
            sleep(1);
        }

        if (my_rank == ROOT)
        {
            printf("Child terminating...\n");
        }
    }

    MPI_Finalize();

    printf("Rank %d (%s) passed MPI finalize\n", my_rank, i_am_parent ? parent : child);

In OpenMPI 4.1.5, the output is along the lines of

Child sleeping for 10 more seconds...
Root parent terminating...
Child sleeping for 9 more seconds...
Child sleeping for 8 more seconds...
Child sleeping for 7 more seconds...
Child sleeping for 6 more seconds...
Other parent exiting
Child sleeping for 5 more seconds...
Rank 0 (parent) passed MPI finalize
Rank 1 (parent) passed MPI finalize
Child sleeping for 4 more seconds...
Child sleeping for 3 more seconds...
Child sleeping for 2 more seconds...
Child sleeping for 1 more seconds...
Child terminating...
Rank 1 (child) passed MPI finalize
Rank 0 (child) passed MPI finalize

However, in OpenMPI 5.0.5, 5.1.0, the program fully stops after the parents terminate, without any error message or sign of abnormal termination.

Root parent terminating...
Child sleeping for 10 more seconds...
Child sleeping for 9 more seconds...
Child sleeping for 8 more seconds...
Child sleeping for 7 more seconds...
Child sleeping for 6 more seconds...
Other parent exiting
Child sleeping for 5 more seconds...
Rank 0 (parent) passed MPI finalize
Rank 1 (parent) passed MPI finalize

I ran this with

mpirun -np 2 ./only-child-procs

and then

mpirun --mca pml ob1 --mca btl ^uct -np 2 ./only-child-procs

And the results are the same in both for both versions.

I figured out that keeping one of the original processes 'hostage' for the lifetime of the children sidesteps the issue and produces the expected output in OMPI5;

    ... //same code initially
    MPI_Finalize();
    printf("Rank %d (%s) passed MPI finalize\n", my_rank, i_am_parent ? parent : child);

    if (i_am_parent && my_rank == 0)
    {
        int file_exists;

        while (1)
        {
            file_exists = system("[ -f .iamdone ]");

            if (file_exists == 0)
            {
                system("rm .iamdone");
                break;
            }

            sleep(1);
        }
    }
    else if (!i_am_parent)
    {
        system("touch .iamdone");
    }

Obviously it is not an ideal solution, but it might give you some insight as to what is going on.

rhc54 commented 2 weeks ago

It's a case of "your default is my bug", I'm afraid. We took some heat over the behavior you are requesting as it resulted in jobs staying alive (and generating costs) after the parent abnormally terminated. So we proposed to kill all child jobs in that scenario. But then people complained because the parent needed to terminate itself (e.g., computed something that indicated it should die), and the child jobs kept running until end of the allocation. Obviously, we could require them to include some communication for the parent to tell its children to die, but that creates its own complications.

So we now have a default dependency that terminates all child jobs upon termination of the parent. 🤷‍♂️

The runtime is fully capable of allowing the child jobs to continue - in fact, there is an option for telling it to do so.

#define PMIX_SPAWN_CHILD_SEP                "pmix.spchildsep"       // (bool) Treat the spawned job as independent from the parent - i.e, do not
                                                                    //        terminate the spawned job if the parent terminates.

However, I don't know if OMPI exposes that option - I don't see it in the code base, but it might get passed through if you include it in an MPI_Info. I'd have to look at it a bit - can do so once I'm done with some other things. Then it becomes a debate over the default setting again, but that's another matter.

Petter-Programs commented 2 weeks ago

Thank you for shining some light on this issue! From my perspective, it would be very useful to have some way to toggle the behavior whether or not it ends up being the default. Also, it could be worthwhile to include somewhere (at least when some level of additional verbosity is enabled) a notice to say that the child jobs were terminated for this reason.

Just to note that I had a shot at trying to pass the spawn child jobs separately option to MPI_Comm_spawn through an info object with no success (although I'm not sure what format would be the proper one, I pasted the different options I tried down below). I suspect that it may not be possible to do it this way.

MPI_Info_set(keep_alive_info, "pmix.spchildsep", "1");
MPI_Info_set(keep_alive_info, "PMIX_SPAWN_CHILD_SEP", "1");
MPI_Info_set(keep_alive_info, "PMIX_ENVAR", "pmix.spchildsep=1");
MPI_Info_set(keep_alive_info, "PMIX_ENVAR", "PMIX_SPAWN_CHILD_SEP=1");
rhc54 commented 2 weeks ago

Yeah, it looks like I missed completing the implementation in the runtime - my apologies. Shouldn't be hard to do, so hopefully will plug the gap soon. The thought about the termination report is a good one - will see what can be done.

rhc54 commented 2 weeks ago

Are you using an external build of PRRTE? Or the one internal to the OMPI distribution? I ask because I have a fix upstream, but it will take quite some time to trickle down into an OMPI release. If you are using an external build, or are able/willing to do so, then I'd appreciate if you could try the fix and verify it works for you.

rhc54 commented 2 weeks ago

Rats - missing the termination report. Will add that in a bit.

Petter-Programs commented 2 weeks ago

I was using one internal to the OMPI distribution, but I'd be happy to use an external one and test your fix!

rhc54 commented 2 weeks ago

Okay, it's been committed to the upstream PRRTE master branch if you have a chance to clone and test it.

rhc54 commented 2 weeks ago

Unfortunately, it won't be quite as simple to test as I had hoped. The internal version of PMIx is somewhat out-of-date as well and doesn't include the PMIX_SPAWN_CHILD_SEP attribute, and it looks like the OMPI comm_spawn code doesn't have a way of allowing you to just pass an arbitrary key.

You have a couple of choices. I've provided a patch for the OMPI portion here:

diff --git a/ompi/dpm/dpm.c b/ompi/dpm/dpm.c
index d6cfd5aa3c..0903de5449 100644
--- a/ompi/dpm/dpm.c
+++ b/ompi/dpm/dpm.c
@@ -1554,6 +1554,23 @@ int ompi_dpm_spawn(int count, const char *array_of_commands[],
                 OBJ_RELEASE(info_str);
             }
 #endif
+
+            /* check for 'child sep' - job-level key */
+            ompi_info_get_bool(array_of_info[i], "PMIX_SPAWN_CHILD_SEP", &local_spawn, &flag);
+            if ( flag ) {
+                info = OBJ_NEW(opal_info_item_t);
+                PMIX_INFO_LOAD(&info->info, PMIX_SPAWN_CHILD_SEP, &local_spawn, PMIX_BOOL);
+                opal_list_append(&job_info, &info->super);
+            } else {
+                checkkey = PMIx_Get_attribute_string("PMIX_SPAWN_CHILD_SEP");
+                ompi_info_get_bool(array_of_info[i], "checkkey", &local_spawn, &flag);
+                if ( flag ) {
+                    info = OBJ_NEW(opal_info_item_t);
+                    PMIX_INFO_LOAD(&info->info, PMIX_SPAWN_CHILD_SEP, &local_spawn, PMIX_BOOL);
+                    opal_list_append(&job_info, &info->super);
+                }
+            }
+
         }

         /* default value: If the user did not tell us where to look for the

In order to build it, you could point OMPI at an external PMIx build so the new attribute would be defined. If you go this route, then just turn off the internal PRRTE build as you won't need it anyway:

--with-libevent=external --with-hwloc=external --with-pmix=<path> --without-prrte

Should get it all to work. I created a PMIx-based example to validate the implementation - it's in PRRTE's example/dynamic-dep.c code. The support is available in PMIx and PRRTE master branches, as well as at the head of their respective release branches (PMIx v5.0 and PRRTE v3.0) - so you can use whichever combination you prefer.

Hope that isn't too cumbersome. There is an alternative path, should this prove too daunting. If you are using a GitHub clone of the v5.0 branch of the OMPI code, then all you would need to do is:

$ cd 3rd-party/openpmix
$ git checkout v5.0
$ git pull
$ cd ../prrte
$ git checkout v3.0
$ git pull

Then do the usual build. This will be using the head of the respective release branches.

Petter-Programs commented 2 weeks ago

I am happy to report that this change seems to work without any issues on my system :+1: I git clone'd the latest PRRTE and PMIX releases and tested (with prterun) your dynamic-dep.c code as well as the code I provided as part of this issue. With the code change you provided for OMPI, toggling the behavior works fine too. Thank you for looking into this!

rhc54 commented 2 weeks ago

👍 Thanks for the confirmation!