ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Difficulties with spawning new processes on the victim's node #33

Closed abouteiller closed 5 years ago

abouteiller commented 6 years ago

Original report by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


As reported on the ULFM mailing-list the use of a machinefile to restrict or drive the allocation of new processes is difficult.

abouteiller commented 6 years ago

Original comment by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


This issue is rooted in OMPI and is due to the forwarding of job-level constraints from the original job to all spawnees. In this particular case adding "-npernode 1" restricts all future processes from sharing a node, across all jobid handled by the same HNP. In a normal MPI application such behavior might be desired, but in context of ULFM we need to be able to reuse nodes, which means to respawn processes on a node where older processes failed.

Multiple solution might be envisioned, but I think the cleanest solution is to provide an info key to prevent the original job parameters inheritance. I have create an OMPI issue related to this topic open-mpi/ompi#5376.

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


open-mpi/ompi#5376 has been imported, as well as fixing the 'oversubscribe' non-propagation issue; this should resolve the problem.