Closed Petter-Programs closed 2 weeks ago
It's a case of "your default is my bug", I'm afraid. We took some heat over the behavior you are requesting as it resulted in jobs staying alive (and generating costs) after the parent abnormally terminated. So we proposed to kill all child jobs in that scenario. But then people complained because the parent needed to terminate itself (e.g., computed something that indicated it should die), and the child jobs kept running until end of the allocation. Obviously, we could require them to include some communication for the parent to tell its children to die, but that creates its own complications.
So we now have a default dependency that terminates all child jobs upon termination of the parent. 🤷♂️
The runtime is fully capable of allowing the child jobs to continue - in fact, there is an option for telling it to do so.
#define PMIX_SPAWN_CHILD_SEP "pmix.spchildsep" // (bool) Treat the spawned job as independent from the parent - i.e, do not
// terminate the spawned job if the parent terminates.
However, I don't know if OMPI exposes that option - I don't see it in the code base, but it might get passed through if you include it in an MPI_Info
. I'd have to look at it a bit - can do so once I'm done with some other things. Then it becomes a debate over the default setting again, but that's another matter.
Thank you for shining some light on this issue! From my perspective, it would be very useful to have some way to toggle the behavior whether or not it ends up being the default. Also, it could be worthwhile to include somewhere (at least when some level of additional verbosity is enabled) a notice to say that the child jobs were terminated for this reason.
Just to note that I had a shot at trying to pass the spawn child jobs separately option to MPI_Comm_spawn through an info object with no success (although I'm not sure what format would be the proper one, I pasted the different options I tried down below). I suspect that it may not be possible to do it this way.
MPI_Info_set(keep_alive_info, "pmix.spchildsep", "1");
MPI_Info_set(keep_alive_info, "PMIX_SPAWN_CHILD_SEP", "1");
MPI_Info_set(keep_alive_info, "PMIX_ENVAR", "pmix.spchildsep=1");
MPI_Info_set(keep_alive_info, "PMIX_ENVAR", "PMIX_SPAWN_CHILD_SEP=1");
Yeah, it looks like I missed completing the implementation in the runtime - my apologies. Shouldn't be hard to do, so hopefully will plug the gap soon. The thought about the termination report is a good one - will see what can be done.
Are you using an external build of PRRTE? Or the one internal to the OMPI distribution? I ask because I have a fix upstream, but it will take quite some time to trickle down into an OMPI release. If you are using an external build, or are able/willing to do so, then I'd appreciate if you could try the fix and verify it works for you.
Rats - missing the termination report. Will add that in a bit.
I was using one internal to the OMPI distribution, but I'd be happy to use an external one and test your fix!
Okay, it's been committed to the upstream PRRTE master branch if you have a chance to clone and test it.
Unfortunately, it won't be quite as simple to test as I had hoped. The internal version of PMIx is somewhat out-of-date as well and doesn't include the PMIX_SPAWN_CHILD_SEP attribute, and it looks like the OMPI comm_spawn code doesn't have a way of allowing you to just pass an arbitrary key.
You have a couple of choices. I've provided a patch for the OMPI portion here:
diff --git a/ompi/dpm/dpm.c b/ompi/dpm/dpm.c
index d6cfd5aa3c..0903de5449 100644
--- a/ompi/dpm/dpm.c
+++ b/ompi/dpm/dpm.c
@@ -1554,6 +1554,23 @@ int ompi_dpm_spawn(int count, const char *array_of_commands[],
OBJ_RELEASE(info_str);
}
#endif
+
+ /* check for 'child sep' - job-level key */
+ ompi_info_get_bool(array_of_info[i], "PMIX_SPAWN_CHILD_SEP", &local_spawn, &flag);
+ if ( flag ) {
+ info = OBJ_NEW(opal_info_item_t);
+ PMIX_INFO_LOAD(&info->info, PMIX_SPAWN_CHILD_SEP, &local_spawn, PMIX_BOOL);
+ opal_list_append(&job_info, &info->super);
+ } else {
+ checkkey = PMIx_Get_attribute_string("PMIX_SPAWN_CHILD_SEP");
+ ompi_info_get_bool(array_of_info[i], "checkkey", &local_spawn, &flag);
+ if ( flag ) {
+ info = OBJ_NEW(opal_info_item_t);
+ PMIX_INFO_LOAD(&info->info, PMIX_SPAWN_CHILD_SEP, &local_spawn, PMIX_BOOL);
+ opal_list_append(&job_info, &info->super);
+ }
+ }
+
}
/* default value: If the user did not tell us where to look for the
In order to build it, you could point OMPI at an external PMIx build so the new attribute would be defined. If you go this route, then just turn off the internal PRRTE build as you won't need it anyway:
--with-libevent=external --with-hwloc=external --with-pmix=<path> --without-prrte
Should get it all to work. I created a PMIx-based example to validate the implementation - it's in PRRTE's example/dynamic-dep.c
code. The support is available in PMIx and PRRTE master branches, as well as at the head of their respective release branches (PMIx v5.0 and PRRTE v3.0) - so you can use whichever combination you prefer.
Hope that isn't too cumbersome. There is an alternative path, should this prove too daunting. If you are using a GitHub clone of the v5.0 branch of the OMPI code, then all you would need to do is:
$ cd 3rd-party/openpmix
$ git checkout v5.0
$ git pull
$ cd ../prrte
$ git checkout v3.0
$ git pull
Then do the usual build. This will be using the head of the respective release branches.
I am happy to report that this change seems to work without any issues on my system :+1: I git clone'd the latest PRRTE and PMIX releases and tested (with prterun) your dynamic-dep.c code as well as the code I provided as part of this issue. With the code change you provided for OMPI, toggling the behavior works fine too. Thank you for looking into this!
👍 Thanks for the confirmation!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.1.0 (but also tested on 5.0.5)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone https://github.com/open-mpi/ompi.git
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179) 8d0a25b8c9b868bc66de8df3b8abd7f477f07a0f 3rd-party/prrte (psrvr-v2.0.0rc1-4799-g8d0a25b8c9) dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
Details of the problem
In OpenMPI5, children spawned by MPI_Comm_spawn are terminated once the last process belonging to the parent communicator (original MPI_COMM_WORLD) returns. This is not the behavior in OpenMPI4, where the children can live on past this point.
I am wondering if this is intended behavior, and if there are any way to configure this not to happen. Here's a code snippet to reproduce:
In OpenMPI 4.1.5, the output is along the lines of
However, in OpenMPI 5.0.5, 5.1.0, the program fully stops after the parents terminate, without any error message or sign of abnormal termination.
I ran this with
and then
And the results are the same in both for both versions.
I figured out that keeping one of the original processes 'hostage' for the lifetime of the children sidesteps the issue and produces the expected output in OMPI5;
Obviously it is not an ideal solution, but it might give you some insight as to what is going on.