Closed efweber999 closed 5 months ago
A few observations:
prted
daemon - it is automatically started by mpirun
and/or prte
. Those two programs discover the Slurm allocation via Slurm envars and then launch a prted
on each of the allocated nodes.prted
daemons on every node of the allocation as nodes startup plus start prte
as the controller) is in prototype at this time - no timetable for completion.prun
failure, what is in the hostfile
? Are there any nodes in there that were not allocated by Slurm?hostfile
error - thanks for pointing it out!Thanks Ralph,
I'll look into your first few suggestions.
I had assumed that installing the latest version of OpenMPI installed the the latest version of PRRTE. I did have an OpenMPI 4.x installation and ran make uninstall on that before installing 5.0.2. Is it possible that the older PRRTE was not removed?
PRRTE wasn't included in OMPI v4, it was only introduced in OMPI v5. I was only commenting based on your input:
prte (PRRTE) 3.0.3rc1
If you want to use the latest PRRTE, you'll need to download it directly as OMPI always has a time lag in its distribution. Then, you build OMPI with the --external-prrte
flag so that mpirun
uses it. Make sure that PRRTE is built against the same PMIx used to build OMPI - not a hard requirement, but usually a good idea where possible.
OK, I'll install the latest.
Regarding prte
discovering the Slurm allocation via Slurm envars; how do the involved environmental variables get set. I just searched through documentation for the various tools and I'm not seeing that. Thanks again.
When you get an allocation via salloc
, Slurm automatically populates your environment with a suite of envars containing info such as the names of the allocated nodes. prte
simply harvests those to determine what resources are available to it.
You can see them for yourself - just do srun -n 1 env | grep SLURM
to see the list.
Thanks again Ralph. The software chain has been rebuilt using PRRTE release is v3.0.5. That did seem to be part of the problem.
I hate asking you another question, especially since it's probably due to my lack of familiarity with SLURM. This seems so close to working, but isn't. I'm obviously not launching prted correctly.
shell$ srun -N 2 -n 2 prte
DVM ready
DVM ready
In another term on that same node:
shell$ ps -u ubuntu
PID TTY TIME CMD
12775 pts/0 00:00:00 srun
12780 pts/0 00:00:00 srun
12800 ? 00:00:00 prte
12810 ? 00:00:00 srun
12812 ? 00:00:00 prted
12815 ? 00:00:00 srun
12818 ? 00:00:00 prted
And in a term on the other node:
shell$ ps -u ubuntu
PID TTY TIME CMD
21891 ? 00:00:00 prte
21894 ? 00:00:00 srun
21895 ? 00:00:00 srun
21906 ? 00:00:00 prted
21907 ? 00:00:00 prted
But in either of these terms prun still fails, albeit differently than before:
shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available
Using salloc instead of srun:
shell$ salloc -N 2 -n 2 prte
salloc: Granted job allocation 43
DVM ready
Another term on the same node now has no prted (I notice that only the term where salloc is run has the envars):
shell$ ps -u ubuntu
PID TTY TIME CMD
13014 pts/0 00:00:00 salloc
13018 pts/0 00:00:00 prte
13021 pts/0 00:00:00 srun
13024 pts/0 00:00:00 srun
But the term on the other node does have prted:
shell$ ps -u ubuntu
PID TTY TIME CMD
23189 ? 00:00:00 prted
23190 ? 00:00:00 prted
prun attempt on both nodes is unchanged:
shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available
I've been reading the man pages and trying different options, but this is eluding me. Sorry.
No worries - it's a simple misunderstanding. I need to add material to the docs so this is easier.
The problem is that you cannot start multiple copies of prte
or else prun
will get confused. The reason lies in the architecture of the system. prte
is the Distributed Virtual Machine (DVM) controller - there can only be one instance of the controller. Once you invoke prte
, it will look at the envars to discover the allocation, and then it will launch a prted
instance on each node of the allocation. You don't have to explicitly do anything.
So what you want to do is:
$ salloc -N 2
$ prte --daemonize
$ prun <myapp>
...do whatever you want...
$ pterm (to terminate the DVM)
See if that works for you!
Something else must be incorrect in my setup. I restarted the SLURM daemons with -c to have a clean start.
shell$ salloc -N 2
salloc: Granted job allocation 50
Process started on that node:
shell$ ps -u ubuntu
PID TTY TIME CMD
13811 pts/0 00:00:00 salloc
13815 pts/0 00:00:00 bash
Checked the envars (should the "(x2)" be there?):
shell $ env | grep SLURM
SLURM_TASKS_PER_NODE=16(x2)
SLURM_SUBMIT_DIR=/home/ubuntu
SLURM_CLUSTER_NAME=cluster
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=2
SLURM_JOBID=50
SLURM_NODELIST=ip-13-100-66-[218,228]
SLURM_NNODES=2
SLURM_SUBMIT_HOST=ip-13-100-66-228
SLURM_JOB_ID=50
SLURM_CONF=/usr/local/etc/slurm.conf
SLURM_JOB_NAME=interactive
SLURM_JOB_NODELIST=ip-13-100-66-[218,228]:
Launched prte:
shell$ prte --daemonize
Checked for new processes on this node:
shell$ ps -u ubuntu
PID TTY TIME CMD
13811 pts/0 00:00:00 salloc
13815 pts/0 00:00:00 bash
13842 ? 00:00:00 prte
13845 ? 00:00:00 srun
13848 ? 00:00:00 srun
No prted process. Checked for stared processes on the other node:
shell$ ps -u ubuntu
PID TTY TIME CMD
29255 ? 00:00:00 prted
29256 ? 00:00:00 prted
Two prted processes started. So both nodes still fail with the prun failed to initialize, likely due to no DVM being available
message.
Any thoughts on what tho check next?
Thanks.
Oh, x2 is probably just referring to threads.
Here are a few things you could try:
srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
and see what you getprte --display allocation --prtemca plm_base_verbose 5 &
and see what comes out. Note this will leave prte
in the backgroundls $TMPDIR
to see if there is a directory that starts with prte.
. This is what prun
is looking for when it tries to run.shell$ srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-218
ip-13-100-66-228
shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[ip-13-100-66-228:18216] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:receive start comm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm creating map
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] setup:vm: working unmanaged allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] using default hostfile /usr/local/etc/prte-default-hostfile
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm only HNP in allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setting slots for node ip-13-100-66-228 by core
DVM ready
The prte.ip-13-100-66-228.13377.1000 directory exists in the specified tmp directory, BUT $TMPDIR
is not set.
I decided to look more closely at the PMIx installation log files for clues. The output from running configure --enable-mca-no-build=btl-uct --with-hwloc=/usr/local --with-libevent=/usr/local --with-pmix=/usr/local --with-slurm
contains:
Transports
-----------------------
Cisco usNIC: no
HPE Slingshot: no
NVIDIA: no
OmniPath: yes
Simptest: no
TCP: no
I installed UCX as I'm hoping it will default to TCP while developing on AWS EC2 instances, and then support IB when I port code to an HPC cluster. I expected TCP to be yes
, and I don't know why OmniPath is yes
. So I looked through 'configure.log'. There are numerous errors in that log, but most seem to be by design. By that I mean that the config script will try something and if it results in an error it knows that something isn't available/accessible. That makes it hard for the unfamiliar to find real errors, but I did find one. On line 2528 of configure
is "if (sizeof (($2)))". This causes a C compile error, (at least on my setup). I changed it to "if (sizeof ($2))" which compiles.
The interesting thing is that previously make
had no errors. But fixing the configure
bug exposed two make
errors. One was an obvious bug that was easy to fix. Line 429 of src/util/pmix_net.c
starts with "bool bool". So I deleted one of those type declarations.
The other error is"
../../../src/util/pmix_net.c:423:6: error: conflicting types for 'pmix_net_samenetwork'; have '_Bool(const struct sockaddr *, const struct sockaddr *, uint32_t)' {aka '_Bool(const struct sockaddr *, const struct sockaddr *, unsigned int)'}
423 | bool pmix_net_samenetwork(const struct sockaddr *addr1, const struct sockaddr *addr2,
| ^~~~~~~~~~~~~~~~~~~~
In file included from ../../../src/util/pmix_net.c:66:
/home/ubuntu/software/sandbox_pmix-5.0.2/src/util/pmix_net.h:101:18: note: previous declaration of 'pmix_net_samenetwork' with type '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, uint32_t)' {aka '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, unsigned int)'}
101 | PMIX_EXPORT bool pmix_net_samenetwork(const struct sockaddr_storage *addr1,
| ^~~~~~~~~~~~~~~~~~~~
I tried commenting out one of the declarations, but that caused a link error on library file, so I'm not yet understanding the correct fix.
Thanks in advance for your help. -Gene
Well, first off the transports shown by PMIx have nothing to do with what your MPI supports. It only indicates what transports PMIx knows about and can provide support with some info. For example, OmniPath needs a security key, and so we generate one and provide it for that environment. We don't have support for the others at this time.
The plm verbose output indicates that prte
is using ssh
to launch the remote daemons instead of srun
- in other words, prte
doesn't realize it is in a Slurm allocation. That isn't necessarily a problem as ssh
will work fine, but I have no idea why it is happening.
The errors you are reporting indicate that you lack a struct sockaddr
on your machine??? Pretty weird, though possible I suppose. I've corrected them upstream.
The configure error seems very strange - we don't generate that code. It comes straight out of autoconf. Nothing we can really do about it. Interestingly, I see the line if (sizeof ($2))
used in the AC_CHECK_TYPE
generated code right about the one you cite. I'm guessing that perhaps you are hitting some kind of if-else
situation again that causes you to encounter the error when nobody else hits it. What version of autoconf are you using?
@wenduwan @lrbison This is an Amazon user - can you perhaps help him out? I'm running out of ideas and have no access to the system.
@efweber999 For HPC applications on AWS we recommend https://aws.amazon.com/hpc/parallelcluster/ which pre-installs essential applications including Intel MPI, Open MPI etc., and sets up the network for you. Would you be willing to try it out?
Thank you for the clarifications and continued help.
My access to AWS has some restrictions, and I'm not sure if using parallelcluster is enabled/allowed. I've inquired.
Some clarification after working on this yesterday. if (sizeof (($2)))
was a red herring. It generates errors, but also generates output that is more correct than when I "fix" it. Autoconf is autoconf (GNU Autoconf) 2.71.
The configure check to determine if the system is Linux with TCP fails:
if test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes"
$pmix_found_linux
is not set, which is understandable since it appears nowhere else in the configure file.
Running find . -print -type f -exec grep "pmix_found_linux" {} \; >foobar
in the installation directory only finds:
./src/mca/pif/linux_ipv6/configure.m4
AS_IF([test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes"],
./configure
if test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes"
So it doesn't appear to be set anywhere. I hard coded both variables to "yes" so the test passes, but things still don't work. So, perhaps I just don't understand how configure works and this is also a red herring.
After doing that and rebuilding all the tools again I re-ran what you suggested. I admit to taking your suggestions too literally the day before and not running salloc before prte. DOH!! So those results were misleading. Here is exactly the string of commands I ran yesterday and the output.
shell$ sudo rm /var/log/slurm/*
shell$ sudo slurmctld -c -vvvvv && sudo slurmd -c -vvvvv
shell$ ps -u root | grep slurm
16030 ? 00:00:00 slurmctld
16032 ? 00:00:00 slurmscriptd
16049 ? 00:00:00 slurmstepd
16051 ? 00:00:00 slurmd
shell$ srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-228
ip-13-100-66-218
shell$ salloc -N 2
salloc: Granted job allocation 2
shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[1] 16368
shell$ [ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:slurm: available for selection
[ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive start comm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: LAUNCH DAEMONS CALLED
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm creating map
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm add new daemon [prte-ip-13-100-66-218-16368@0,1]
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm assigning new daemon [prte-ip-13-100-66-218-16368@0,1] to node ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: launching on nodes ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: final top-level argv:
srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"
srun: error: ip-13-100-66-228: task 0: Exited with exit code 213
srun: Terminating StepId=2.0
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: srun returned non-zero exit status (54528) from launching the per-node daemon
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive stop comm
[1]+ Exit 250
The tool installations are all scripted, and I'm willing to share that and/or any of my configuration files if that helps.
Thanks, Gene
I'll look at the configure code again, but that has nothing to do with this problem. I do see something troubling in your output. It appears you have some PRRTE MCA parameters set, either in the environment or perhaps in a default MCA param file??? This one in particular is bothersome:
plm_slurm_args "--external-launcher"
Is there some reason you are setting this? I suspect it is causing srun
to fail.
I'm not setting external-launcher
anywhere. I've searched through all the the configuration files and install files and find nothing remotely close.
Searching for information about that option it appears to be new in this SLURM release. https://www.schedmd.com/slurm-version-23-11rc1-is-now-available/
The online srun man page says: "This is meant for use MPI implementations that require their own launcher" The SLURM install documentation says it is passed as an argument for Intel MPI, MPICH, and MVAPICH2 when Hydra is used. So it shouldn't be applied.
I have "MpiDefault=pmix" set in my slurm.conf file. Options were: none, pmi2, cray_shasta, or pmix.
@wenduwan My system administrators just responded to my inquiry. "At this time, Parallelcluster is not approved".
Sounds like you may be out of luck, but let's try one more thing:
$ srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 hostname
See if that aborts or runs (change the nodelist to whatever node you are allocated). This is the cmd that PRRTE is trying to use to start the remote daemon. Obviously, someone has Slurm adding that new --external-launcher
option because it definitely isn't coming from us. Let's see if it is breaking us.
@efweber999 Thanks for confirming.
AFAIK AWS does not officially support custom installation of slurm/pmix/prrte. They should either be installed by parallel cluster, or via EFA installer.
In your case I believe the system admin has chosen a different installation - I wonder if it's possible to try the EFA installer which includes its own prrte/pmix under /opt/amazon
.
The latest release 1.31.0 is built with openpmix 4.2.8 and prrte 3.0.3.
@rhc54 That line works fine.
I guess you might as well try running the rest of that cmd line on its own:
prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"
and see if the prted
barfs. It won't be able to startup because prte
itself isn't running, but the fact that the srun
cmd immediately errors out implies that something in this cmd isn't happy.
It doesn't barf, it does nothing.
Ralph, Wenduo, Thanks for your help.
Environment
prte (PRRTE) 3.0.3rc1 pmix_info (PMIx) 5.0.2
OS: Ubuntu 22.04.1 Hardware: AWS EC2 instances. Just 2 instances for initial testing. Network: UCX (only TCP is available for these instances. Will eventually move to an HPC cluster with IB)
Details of the problem
Hi, I installed PMIx, OpenMPI, SLURM and the supporting software on two AWS EC2 instances. Full installation listing is below. The munge daemon runs without issue, as do the SLURM daemons. Both the
hostname
command and an MPIhello_world
program can be executed across both nodes (EC2 instances).When I try and launch the prte daemon to use PMIx, I get the following error:
I tried various command line options with the same result. The session directory location is specified in the config files: mpi.conf.txt prte.conf.txt I also tried setting the TMPDIR environmental variable, and got the same results.
prte --daemonize
launches fine, and I can see the prte process running. Butprun
produces the following results:Again, I tried multiple command line options with the same result.
I'm new to UCX, MPI, PMIx, and SLURM (though I'm a graybeard) so I'm probably missing something that I just haven't yet managed to find in the documentation. Some guidance would be greatly appreciated. The documentation is saying to bootstrap
prted
at node startup, but it's not supposed to run as root, so that rules outsystemd
. I can put it in the .bashrc, but that's not node startup. And, does it matter if the SLURM daemons are already running as I plan to usesystemd
to launch them.BTW, during one test launch when I was in the wrong directory and the
hostfile
could not be found, this message was printed:The file prte is looking for is actually
help-hostfiles.txt
, so that's a minor bug.Thanks,
Gene
Installed Packages
libssl-dev libnuma-dev binutils-bpf libbpf-devs dbus libdbus-1-dev
Built Software (In this order)
munge-0.5.16 configure --with-crypto-lib=openssl --prefix=/usr --sysconfdir=/etc --localstatedir=/var --runstatedir=/run
hwloc-2.7.1 configure
lbnl-nhc-1.4.3 configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec
ucx-1.15.0 (tarball of deb packages)
libevent-2.1.12 configure --disable-openssl
pmix-5.0.2 configure --with-munge=/usr --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm
openmpi-5.0.2 configure --enable-mca-no-build=btl-uct --with-ucx=/usr --with-pmix=/usr/local --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm
slurm-23.11.5 configure --enable-debug --with-ucx=/usr --with-pmix=/usr/local --with-munge=/usr --with-hwloc=/usr/local