open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

MPMD MPI_APPNUM equivalent with srun #11845

Open b-fg opened 1 year ago

b-fg commented 1 year ago

To run MPMD I use the classic form: mpirun -n 4 exe1 : -n 4 exe2 : .... I normally split the world communicator into a communicator for each executable using MPI_APPNUM (Fortran):

integer(mpi_address_kind) :: color_ptr
logical :: mpi_app_num_flag

! Init MPI_COMM_WORLD communicator (shared accros all applications launched with mpirun MPMD)
call mpi_init(mpi_err)
world_comm = MPI_COMM_WORLD
call mpi_comm_rank(world_comm, mpi_world_rank, mpi_err)
call mpi_comm_size(world_comm, mpi_world_size, mpi_err)

! Get the app number (color)
call MPI_Comm_get_attr(world_comm, MPI_APPNUM, color_ptr, mpi_app_num_flag, mpi_err)
app_color = color_ptr ! necessary to get integer from pointer

! Split world_comm and create a communicator for this app only (color must be unique for each application)
if (mpi_app_num_flag) then
   call MPI_Comm_split(world_comm, app_color, mpi_world_rank, app_comm, mpi_err)
   call MPI_Comm_rank(app_comm, mpi_rank, mpi_err)
   call MPI_Comm_size(app_comm, mpi_size, mpi_err)
else
   write(*,*) 'Fatal error in init_mpi()! Cannot find MPI_APPNUM.'
   call MPI_Abort(world_comm,-1,mpi_err)
end if

However, the MPI_APPNUM variable is not found when running with srun (mpi_app_num_flag results in .false.). How can I use srun and still have access to MPI_APPNUM, ie: srun -n 4 exe? Otherwise a different implementation will be required to run with both commands seamlessly. Thanks for the support!

rhc54 commented 1 year ago

Not sure what version of OMPI you are using, nor the exact srun cmd format you are providing, but I very much doubt that you are going to get that variable when executing under srun. Some versions of OMPI read it from the environment (given as an OMPI_xxx value) and some get it via PMIx - to the best of my knowledge, Slurm provides neither at this time.

b-fg commented 1 year ago

OMPI v4.0 and srun exe1 command, so I let srun select the number of processes according to the allocated resources. Thank you for the informative response, I will keep that in mind. Please feel free to close the issue :)

bosilca commented 1 year ago

This is far from complete, because MPI_APPNUM is a required predefined attribute on MPI_COMM_WORLD. It shall not matter under which batch scheduler the application is started, the attribute must exists.

There is a predefined attribute MPI_APPNUM of MPI_COMM_WORLD. In Fortran, the at- tribute is an integer value. In C, the attribute is a pointer to an integer value. If a process was spawned with MPI_COMM_SPAWN_MULTIPLE, MPI_APPNUM is the command number that generated the current process. Numbering starts from zero. If a process was spawned with MPI_COMM_SPAWN, it will have MPI_APPNUM equal to zero. Additionally, if the process was not started by a spawn call, but by an implementation- specific startup mechanism that can handle multiple process specifications, MPI_APPNUM should be set to the number of the corresponding process specification.

ggouaillardet commented 1 year ago

As far as I understand, the SLURM sets the PMIX_APPNUM key and Open MPI retrieves it in order to set the MPI_APPNUM attribute. that being said SLURM always sets PMIX_APPNUM to 0, even if MPMD (aka srun --multi-prog ...) is used,

b-fg commented 1 year ago

Thanks for the feedback. Can you explain why this is an OMPI bug, @bosilca? Shouldn't this directly come from SLURM?

ggouaillardet commented 1 year ago

Which version of Open MPI are you running?

This is a bug if the MPI_APPNUM attribute of MPI_COMM_WORLD is not set.

I was unable to reproduce this issue: the attribute is always set to 0 under srun in my environment.

bosilca commented 1 year ago

It does not matter what SLURM provides or not, they are not an MPI implementation, so they cannot care less of what the MPI standard requires. OMPI is an MPI implementation, we must provide what the standard defines or we cannot claim compatibility with a specific version. And MPI_APPNUM was there for a very long time. So, we need to find how to make sure it is correctly defined every time.

ggouaillardet commented 1 year ago

@b-fg when using srun, do you use PMI-1, PMI-2 or PMIx?

ggouaillardet commented 1 year ago

@bosilca, this is a snippet from the SLURM PMIx plugin

from src/plugins/mpi/pmix/pmixp_client.c:

                /* TODO: always use 0 for now. This is not the general case
                 * though (see Slurm MIMD: man srun, section MULTIPLE PROGRAM
                 * CONFIGURATION)
                 */
                tmp = 0;
                PMIXP_KVP_CREATE(kvp, PMIX_APPNUM, &tmp, PMIX_INT);

That suggests they are aware the feature is not yet implemented.

I will double check tomorrow if MPICH sets it correctly with srun --multi-prog

ggouaillardet commented 1 year ago

FWIW, MPICH and Open MPI behave the same: MPI_APPNUM is defined and always 0 when srun is used. (in both cases, this is incorrect when MPMD is used)

Currently, mpirun -np 1 ./a.out : -np 1 ./a.out sets MPI_APPNUM to 0 on rank 0 and 1 on rank 1. I do not think there is a way to achieve a similar result with srun unless SLURM provides extra information.

We could handle srun --multi-prog mp where mp contains

0 ./a.out
1 ./b.out

but that would require at least an MPI_Allgather() like operation at MPI_Init() and that would hurt scalability.

Bottom line, my recommendation is to do nothing besides documenting it and explaining SLURM does not provide enough information to implement this correctly in Open MPI.

rhc54 commented 1 year ago

Let me see if we can get Slurm to fix it.

b-fg commented 1 year ago

@ggouaillardet I am using OMPI v.4.0.1 and I am not sure how to get the PMI version (I am testing this in a cluster). I wanted to clarify: With OMPI, I can get a value for MPI_APPNUM when running with srun, even though I cannot do MPMD as I do with mpirun. In my environment this is set to 4. On the other hand, MPI_APPNUM is not even found when using srun and IMPI, but of course this is not your problem. Thanks all for looking into this.

rhc54 commented 1 year ago

SchedMD has opened a ticket on this, but don't expect near-term response as they aren't seeing any customer interest in MPMD at this time.