open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

Issues with using srun with executable from open-mpi main branch #10286

Closed wzamazon closed 6 months ago

wzamazon commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

Encountered issue when using srun (from slurm) with executable from open mpi main branch

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

build from source. Configure options are:

./configure --prefix=/fsx/ALinux2/PortaFiducia/libraries/openmpi/main/install --with-sge --without-verbs --disable-builtin-atomics --with-libfabric=/opt/amazon/efa --disable-man-pages

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 6692c28a4daed5e99443eb724231d7300287fb2c 3rd-party/openpmix (v1.1.3-3495-g6692c28a)
 7ae2c083189db0881d2eff29d71bd507be02bad3 3rd-party/prrte (psrvr-v2.0.0rc1-4340-g7ae2c08318)

Please describe the system on which you are running


Details of the problem

I tried to use srun (from slurm 20.11.8) with the executable from open-mpi main branch, and encountered issue.

I used the following testing code:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    int myrank, numprc;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &numprc);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from proc %d of %d\n", myrank, numprc);

    MPI_Finalize();
}

Used the following command to compile:

mpicc -o test_mpi_main test_mpi.c

I used the following command to run:

srun -n 4 ./test_mpi_main

The output I got was:

[ec2-user@ip-172-31-82-82 openmpi]$ srun -n 4  ./test_mpi_main 
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1

As can be seen, the rank and wordsize is not right in the output.

Meanwhile, If I use the following command:

srun -n 4 --mpi=pmix ./test_mpi_main

I got the correct result.

[ec2-user@ip-172-31-82-82 openmpi]$ srun -n 4 --mpi=pmix ./test_mpi_main 
Hello from proc 1 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4
Hello from proc 0 of 4

FYI, the behavior with open-mpi 4.1.2 is different from main branch. If I use srun with an executable from open-mpi 4.1.2, I got the following warning message:

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

IMO, the behavior of v4.1.2 is better and should be retained in main

wzamazon commented 2 years ago

I did some investigation.

The error message in v4.1.x was emitted from "orte/mca/ess/pmi/ess_pmi_module.c", when opal_pmix.init() failed and the job was directly launched.

However, re-implementing the same logic in main branch does not seem trivial because:

  1. orte is gone. main branch uses prrte
  2. in prrte's ess component, there is no pmi module, but slurm and alps.
  3. there is not a function check_launch_environment which can be used to detect whether direct launch is used.
wzamazon commented 2 years ago

Opened a prrte issue https://github.com/openpmix/prrte/issues/1343. closing this one.

wzamazon commented 2 years ago

Got the following response from prrte

I'm afraid that is something for the OMPI folks. The way the integration is currently written assumes that the failure to connect to a PMIx server means that the process is acting as a singleton. I'm afraid there is no way to determine if the process should have found a server, so you either have to declare that any inability to find a server is an error (and not a singleton), or you have to assume you are a singleton and proceed accordingly.

Either way, it has nothing to do with PRRTE or PMIx.
awlauria commented 2 years ago

PMI support was dropped in v5.0.0/main. What is the output of srun -n 1 env when running with --with-mpi=pmix and without?

ggouaillardet commented 2 years ago

I think the question is "what should we do is an end user srun --mpi=pmi1 -n 2 a.out (2 or more tasks) ?

A consequence of dropping PMI support from v5 is that we now (silently) spawn singletons, and this is very likely not what the end users would expect. If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?

wzamazon commented 2 years ago

output of srun -n 1 env is

SLURM_CONF=/opt/slurm/etc/slurm.conf
SLURM_PRIO_PROCESS=0
SRUN_DEBUG=3
SLURM_UMASK=0002
SLURM_CLUSTER_NAME=parallelcluster
SLURM_SUBMIT_DIR=/home/ec2-user
SLURM_SUBMIT_HOST=ip-172-31-82-82
SLURM_JOB_NAME=env
SLURM_JOB_CPUS_PER_NODE=1
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_JOB_ID=134
SLURM_JOBID=134
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=c5n-st-c5n18xlarge-1
SLURM_JOB_PARTITION=c5n
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=36877
SLURM_JOB_UID=1000
SLURM_JOB_USER=ec2-user
SLURM_WORKING_CLUSTER=parallelcluster:172.31.82.82:6820:9216:109
SLURM_JOB_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=36877
SLURM_SRUN_COMM_HOST=172.31.82.82
SLURM_TOPOLOGY_ADDR=c5n-st-c5n18xlarge-1
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_CPU_BIND=quiet,mask_cpu:0x000000001
SLURM_CPU_BIND_LIST=0x000000001
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASK_PID=29227
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=172.31.82.82
SLURM_GTIDS=0
SLURM_JOB_GID=1000
SLURMD_NODENAME=c5n-st-c5n18xlarge-1
TMPDIR=/tmp

The output of srun -n 1 --mpi=pmix env is

SLURM_CONF=/opt/slurm/etc/slurm.conf
SLURM_MPI_TYPE=pmix
SLURM_PRIO_PROCESS=0
SRUN_DEBUG=3
SLURM_UMASK=0002
SLURM_CLUSTER_NAME=parallelcluster
SLURM_SUBMIT_DIR=/home/ec2-user
SLURM_SUBMIT_HOST=ip-172-31-82-82
SLURM_JOB_NAME=env
SLURM_JOB_CPUS_PER_NODE=1
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_JOB_ID=135
SLURM_JOBID=135
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=c5n-st-c5n18xlarge-1
SLURM_JOB_PARTITION=c5n
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=33849
SLURM_JOB_UID=1000
SLURM_JOB_USER=ec2-user
SLURM_WORKING_CLUSTER=parallelcluster:172.31.82.82:6820:9216:109
SLURM_JOB_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=33849
SLURM_PMIXP_ABORT_AGENT_PORT=41361
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,1))
SLURM_SRUN_COMM_HOST=172.31.82.82
SLURM_TOPOLOGY_ADDR=c5n-st-c5n18xlarge-1
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_CPU_BIND=quiet,mask_cpu:0x000000001
SLURM_CPU_BIND_LIST=0x000000001
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASK_PID=29261
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=172.31.82.82
SLURM_GTIDS=0
SLURM_JOB_GID=1000
SLURMD_NODENAME=c5n-st-c5n18xlarge-1
PMIX_NAMESPACE=slurm.pmix.135.0
PMIX_RANK=0
PMIX_SERVER_URI3=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SERVER_URI2=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SERVER_URI21=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SECURITY_MODE=native
PMIX_PTL_MODULE=tcp,usock
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.135.0/
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.135.0//pmix_dstor_ds21_29252
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.135.0//pmix_dstor_ds12_29252
PMIX_HOSTNAME=c5n-st-c5n18xlarge-1
PMIX_VERSION=3.1.5

Main difference is the PMIX environments.

bwbarrett commented 2 years ago

I think the question is "what should we do is an end user srun --mpi=pmi1 -n 2 a.out (2 or more tasks) ?

A consequence of dropping PMI support from v5 is that we now (silently) spawn singletons, and this is very likely not what the end users would expect. If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?

How do we do this without reinventing the wheel runtime wise?

ggouaillardet commented 2 years ago

My creativity is being tested... Challenge accepted :-)

awlauria commented 2 years ago

@wzamazon can you try with the patches linked above for main/v5.0.0?

wenduwan commented 6 months ago

Still seeing this behavior on 5.0.3

ubuntu@ip-172-31-61-3:~$ srun -n 4 hello
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmix -n 4 hello
Hello from proc 0 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4
Hello from proc 1 of 4
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmi -n 4 hello
srun: error: Couldn't find the specified plugin name for mpi/pmi looking at all files
srun: error: cannot find mpi plugin for mpi/pmi
srun: error: MPI: Cannot create context for mpi/pmi
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmi', --mpi=list for acceptable types
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmi1 -n 4 hello
srun: error: Couldn't find the specified plugin name for mpi/pmi1 looking at all files
srun: error: cannot find mpi plugin for mpi/pmi1
srun: error: MPI: Cannot create context for mpi/pmi1
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmi1', --mpi=list for acceptable types

We should at least mention this behavior on https://docs.open-mpi.org/en/v5.0.x/launching-apps/slurm.html

rhc54 commented 6 months ago

Yeah - OMPI no longer supports the older PMI versions, and you have to give the --mpi= argument or else Slurm assumes they are all singletons.

wenduwan commented 6 months ago

Updated doc https://github.com/open-mpi/ompi/pull/12515

Now we explicitly ask users to direct launch with --mpi=pmix.

I tend to say that this is good enough, but I also agree with @ggouaillardet that silently launching singletons with pmi is a bad surprise.

If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?

However, if we cannot connect to a pmi server, how can we know if the user actually wants to launch a singleton or not? In other words, how to find out the # of tasks?

rhc54 commented 6 months ago

However, if we cannot connect to a pmi server, how can we know if the user actually wants to launch a singleton or not? In other words, how to find out the # of tasks?

Yeah, we wrestled with that for a long time. There simply is no good way to discriminate singleton vs lack of an appropriate server. You could look at envars (both Slurm and ALPS/PALS provide an envar indicating number of tasks), but as we've seen, that is fragile and problematic. Couldn't come up with anything reliable 🤷‍♂️

wenduwan commented 6 months ago

For this issue though I can confirm that AWS will advise customers to direct launch with --mpi=pmix so we can resolve this for now.