Closed wzamazon closed 6 months ago
I did some investigation.
The error message in v4.1.x was emitted from "orte/mca/ess/pmi/ess_pmi_module.c", when opal_pmix.init() failed and the job was directly launched.
However, re-implementing the same logic in main branch does not seem trivial because:
check_launch_environment
which can be used to detect whether direct launch is used. Opened a prrte issue https://github.com/openpmix/prrte/issues/1343. closing this one.
Got the following response from prrte
I'm afraid that is something for the OMPI folks. The way the integration is currently written assumes that the failure to connect to a PMIx server means that the process is acting as a singleton. I'm afraid there is no way to determine if the process should have found a server, so you either have to declare that any inability to find a server is an error (and not a singleton), or you have to assume you are a singleton and proceed accordingly.
Either way, it has nothing to do with PRRTE or PMIx.
PMI support was dropped in v5.0.0/main. What is the output of srun -n 1 env
when running with --with-mpi=pmix
and without?
I think the question is "what should we do is an end user srun --mpi=pmi1 -n 2 a.out
(2 or more tasks) ?
A consequence of dropping PMI support from v5 is that we now (silently) spawn singletons, and this is very likely not what the end users would expect. If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?
output of srun -n 1 env
is
SLURM_CONF=/opt/slurm/etc/slurm.conf
SLURM_PRIO_PROCESS=0
SRUN_DEBUG=3
SLURM_UMASK=0002
SLURM_CLUSTER_NAME=parallelcluster
SLURM_SUBMIT_DIR=/home/ec2-user
SLURM_SUBMIT_HOST=ip-172-31-82-82
SLURM_JOB_NAME=env
SLURM_JOB_CPUS_PER_NODE=1
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_JOB_ID=134
SLURM_JOBID=134
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=c5n-st-c5n18xlarge-1
SLURM_JOB_PARTITION=c5n
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=36877
SLURM_JOB_UID=1000
SLURM_JOB_USER=ec2-user
SLURM_WORKING_CLUSTER=parallelcluster:172.31.82.82:6820:9216:109
SLURM_JOB_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=36877
SLURM_SRUN_COMM_HOST=172.31.82.82
SLURM_TOPOLOGY_ADDR=c5n-st-c5n18xlarge-1
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_CPU_BIND=quiet,mask_cpu:0x000000001
SLURM_CPU_BIND_LIST=0x000000001
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASK_PID=29227
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=172.31.82.82
SLURM_GTIDS=0
SLURM_JOB_GID=1000
SLURMD_NODENAME=c5n-st-c5n18xlarge-1
TMPDIR=/tmp
The output of srun -n 1 --mpi=pmix env
is
SLURM_CONF=/opt/slurm/etc/slurm.conf
SLURM_MPI_TYPE=pmix
SLURM_PRIO_PROCESS=0
SRUN_DEBUG=3
SLURM_UMASK=0002
SLURM_CLUSTER_NAME=parallelcluster
SLURM_SUBMIT_DIR=/home/ec2-user
SLURM_SUBMIT_HOST=ip-172-31-82-82
SLURM_JOB_NAME=env
SLURM_JOB_CPUS_PER_NODE=1
SLURM_NTASKS=1
SLURM_NPROCS=1
SLURM_JOB_ID=135
SLURM_JOBID=135
SLURM_STEP_ID=0
SLURM_STEPID=0
SLURM_NNODES=1
SLURM_JOB_NUM_NODES=1
SLURM_NODELIST=c5n-st-c5n18xlarge-1
SLURM_JOB_PARTITION=c5n
SLURM_TASKS_PER_NODE=1
SLURM_SRUN_COMM_PORT=33849
SLURM_JOB_UID=1000
SLURM_JOB_USER=ec2-user
SLURM_WORKING_CLUSTER=parallelcluster:172.31.82.82:6820:9216:109
SLURM_JOB_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NODELIST=c5n-st-c5n18xlarge-1
SLURM_STEP_NUM_NODES=1
SLURM_STEP_NUM_TASKS=1
SLURM_STEP_TASKS_PER_NODE=1
SLURM_STEP_LAUNCHER_PORT=33849
SLURM_PMIXP_ABORT_AGENT_PORT=41361
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,1))
SLURM_SRUN_COMM_HOST=172.31.82.82
SLURM_TOPOLOGY_ADDR=c5n-st-c5n18xlarge-1
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=1
SLURM_CPU_BIND=quiet,mask_cpu:0x000000001
SLURM_CPU_BIND_LIST=0x000000001
SLURM_CPU_BIND_TYPE=mask_cpu:
SLURM_CPU_BIND_VERBOSE=quiet
SLURM_TASK_PID=29261
SLURM_NODEID=0
SLURM_PROCID=0
SLURM_LOCALID=0
SLURM_LAUNCH_NODE_IPADDR=172.31.82.82
SLURM_GTIDS=0
SLURM_JOB_GID=1000
SLURMD_NODENAME=c5n-st-c5n18xlarge-1
PMIX_NAMESPACE=slurm.pmix.135.0
PMIX_RANK=0
PMIX_SERVER_URI3=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SERVER_URI2=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SERVER_URI21=pmix-server.29252;tcp4://127.0.0.1:40625
PMIX_SECURITY_MODE=native
PMIX_PTL_MODULE=tcp,usock
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.135.0/
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.135.0//pmix_dstor_ds21_29252
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.135.0//pmix_dstor_ds12_29252
PMIX_HOSTNAME=c5n-st-c5n18xlarge-1
PMIX_VERSION=3.1.5
Main difference is the PMIX environments.
I think the question is "what should we do is an end user
srun --mpi=pmi1 -n 2 a.out
(2 or more tasks) ?A consequence of dropping PMI support from v5 is that we now (silently) spawn singletons, and this is very likely not what the end users would expect. If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?
How do we do this without reinventing the wheel runtime wise?
My creativity is being tested... Challenge accepted :-)
@wzamazon can you try with the patches linked above for main/v5.0.0?
Still seeing this behavior on 5.0.3
ubuntu@ip-172-31-61-3:~$ srun -n 4 hello
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
Hello from proc 0 of 1
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmix -n 4 hello
Hello from proc 0 of 4
Hello from proc 2 of 4
Hello from proc 3 of 4
Hello from proc 1 of 4
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmi -n 4 hello
srun: error: Couldn't find the specified plugin name for mpi/pmi looking at all files
srun: error: cannot find mpi plugin for mpi/pmi
srun: error: MPI: Cannot create context for mpi/pmi
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmi', --mpi=list for acceptable types
ubuntu@ip-172-31-61-3:~$ srun --mpi=pmi1 -n 4 hello
srun: error: Couldn't find the specified plugin name for mpi/pmi1 looking at all files
srun: error: cannot find mpi plugin for mpi/pmi1
srun: error: MPI: Cannot create context for mpi/pmi1
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmi1', --mpi=list for acceptable types
We should at least mention this behavior on https://docs.open-mpi.org/en/v5.0.x/launching-apps/slurm.html
Yeah - OMPI no longer supports the older PMI versions, and you have to give the --mpi=
argument or else Slurm assumes they are all singletons.
Updated doc https://github.com/open-mpi/ompi/pull/12515
Now we explicitly ask users to direct launch with --mpi=pmix
.
I tend to say that this is good enough, but I also agree with @ggouaillardet that silently launching singletons with pmi is a bad surprise.
If Open MPI fails to contact a PMIx server, should we issue a (warning) message if a PMI1 or PMI2 environment is detected (with two or more tasks)?
However, if we cannot connect to a pmi server, how can we know if the user actually wants to launch a singleton or not? In other words, how to find out the # of tasks?
However, if we cannot connect to a pmi server, how can we know if the user actually wants to launch a singleton or not? In other words, how to find out the # of tasks?
Yeah, we wrestled with that for a long time. There simply is no good way to discriminate singleton vs lack of an appropriate server. You could look at envars (both Slurm and ALPS/PALS provide an envar indicating number of tasks), but as we've seen, that is fragile and problematic. Couldn't come up with anything reliable 🤷♂️
For this issue though I can confirm that AWS will advise customers to direct launch with --mpi=pmix
so we can resolve this for now.
Thank you for taking the time to submit an issue!
Background information
Encountered issue when using srun (from slurm) with executable from open mpi main branch
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
build from source. Configure options are:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
I tried to use
srun
(from slurm 20.11.8) with the executable from open-mpi main branch, and encountered issue.I used the following testing code:
Used the following command to compile:
I used the following command to run:
The output I got was:
As can be seen, the rank and wordsize is not right in the output.
Meanwhile, If I use the following command:
I got the correct result.
FYI, the behavior with open-mpi 4.1.2 is different from main branch. If I use
srun
with an executable from open-mpi 4.1.2, I got the following warning message:IMO, the behavior of v4.1.2 is better and should be retained in main