Can't launch prted - Githubissues

efweber999 commented 5 months ago

Environment

prte (PRRTE) 3.0.3rc1 pmix_info (PMIx) 5.0.2

OS: Ubuntu 22.04.1 Hardware: AWS EC2 instances. Just 2 instances for initial testing. Network: UCX (only TCP is available for these instances. Will eventually move to an HPC cluster with IB)

Details of the problem

Hi, I installed PMIx, OpenMPI, SLURM and the supporting software on two AWS EC2 instances. Full installation listing is below. The munge daemon runs without issue, as do the SLURM daemons. Both the hostname command and an MPI hello_world program can be executed across both nodes (EC2 instances).

shell$ mpirun -v --hostfile hostfile hello_world
ip-13-100-66-228, rank 0 out of 2 processors, sent the hello message.
ip-13-100-66-218, rank 1 out of 2 processors, received: Hello, there

shell$ srun -N 2 -n 2 hello_world
ip-13-100-66-228, rank 1 out of 2 processors, received: Hello, there
ip-13-100-66-218, rank 0 out of 2 processors, sent the hello message.

When I try and launch the prte daemon to use PMIx, I get the following error:

shell$ prted -v --bootstrap --leave-session-attached --prtemca ess_base_verbose 5
[ip-13-100-66-228:12351] NODE[0]: 13.100.66.228
[ip-13-100-66-228:12351] NODE[1]: 13.100.66.218
[ip-13-100-66-228:12351] PRTE ERROR: Not found in file ../../../../../../../3rd-party/prrte/src/mca/ess/env/ess_env_module.c at line 129
[ip-13-100-66-228:12351] [[INVALID],0] setting up session dir with
        tmpdir: UNDEF
        host ip-13-100-66-228

I tried various command line options with the same result. The session directory location is specified in the config files: mpi.conf.txt prte.conf.txt I also tried setting the TMPDIR environmental variable, and got the same results.

prte --daemonize launches fine, and I can see the prte process running. But prun produces the following results:

shell$ prun --hostfile hostfile hello_world
[ip-13-100-66-228:12955] PMIx_Spawn failed (-179): PMIX_ERR_JOB_FAILED_TO_MAP

Again, I tried multiple command line options with the same result.

I'm new to UCX, MPI, PMIx, and SLURM (though I'm a graybeard) so I'm probably missing something that I just haven't yet managed to find in the documentation. Some guidance would be greatly appreciated. The documentation is saying to bootstrap prted at node startup, but it's not supposed to run as root, so that rules out systemd. I can put it in the .bashrc, but that's not node startup. And, does it matter if the SLURM daemons are already running as I plan to use systemd to launch them.

BTW, during one test launch when I was in the wrong directory and the hostfile could not be found, this message was printed:

But I couldn't open the help file:
    /usr/local/share/prte/help-hostfile.txt: No such file or directory.  Sorry!

The file prte is looking for is actually help-hostfiles.txt, so that's a minor bug.

Thanks,

Gene

Installed Packages

libssl-dev libnuma-dev binutils-bpf libbpf-devs dbus libdbus-1-dev

Built Software (In this order)

munge-0.5.16 configure --with-crypto-lib=openssl --prefix=/usr --sysconfdir=/etc --localstatedir=/var --runstatedir=/run

hwloc-2.7.1 configure

lbnl-nhc-1.4.3 configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec

ucx-1.15.0 (tarball of deb packages)

libevent-2.1.12 configure --disable-openssl

pmix-5.0.2 configure --with-munge=/usr --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm

openmpi-5.0.2 configure --enable-mca-no-build=btl-uct --with-ucx=/usr --with-pmix=/usr/local --with-hwloc=/usr/local --with-libevent=/usr/local --with-slurm

slurm-23.11.5 configure --enable-debug --with-ucx=/usr --with-pmix=/usr/local --with-munge=/usr --with-hwloc=/usr/local

rhc54 commented 5 months ago

A few observations:

You never directly start the prted daemon - it is automatically started by mpirun and/or prte. Those two programs discover the Slurm allocation via Slurm envars and then launch a prted on each of the allocated nodes.
The ability to bootstrap a PRRTE DVM (i.e., establish a set of prted daemons on every node of the allocation as nodes startup plus start prte as the controller) is in prototype at this time - no timetable for completion.
On the prun failure, what is in the hostfile? Are there any nodes in there that were not allocated by Slurm?
I'll take a gander at the hostfile error - thanks for pointing it out!
Be aware that the latest PRRTE release is v3.0.5 - you are using a fairly old release candidate, and not an official final release. Might be worth updating.

efweber999 commented 5 months ago

Thanks Ralph,

I'll look into your first few suggestions.

I had assumed that installing the latest version of OpenMPI installed the the latest version of PRRTE. I did have an OpenMPI 4.x installation and ran make uninstall on that before installing 5.0.2. Is it possible that the older PRRTE was not removed?

rhc54 commented 5 months ago

PRRTE wasn't included in OMPI v4, it was only introduced in OMPI v5. I was only commenting based on your input:

prte (PRRTE) 3.0.3rc1

If you want to use the latest PRRTE, you'll need to download it directly as OMPI always has a time lag in its distribution. Then, you build OMPI with the --external-prrte flag so that mpirun uses it. Make sure that PRRTE is built against the same PMIx used to build OMPI - not a hard requirement, but usually a good idea where possible.

efweber999 commented 5 months ago

OK, I'll install the latest.

Regarding prte discovering the Slurm allocation via Slurm envars; how do the involved environmental variables get set. I just searched through documentation for the various tools and I'm not seeing that. Thanks again.

rhc54 commented 5 months ago

When you get an allocation via salloc, Slurm automatically populates your environment with a suite of envars containing info such as the names of the allocated nodes. prte simply harvests those to determine what resources are available to it.

You can see them for yourself - just do srun -n 1 env | grep SLURM to see the list.

efweber999 commented 5 months ago

Thanks again Ralph. The software chain has been rebuilt using PRRTE release is v3.0.5. That did seem to be part of the problem.

I hate asking you another question, especially since it's probably due to my lack of familiarity with SLURM. This seems so close to working, but isn't. I'm obviously not launching prted correctly.

shell$  srun -N 2 -n 2 prte
DVM ready
DVM ready

In another term on that same node:

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  12775 pts/0    00:00:00 srun
  12780 pts/0    00:00:00 srun
  12800 ?        00:00:00 prte
  12810 ?        00:00:00 srun
  12812 ?        00:00:00 prted
  12815 ?        00:00:00 srun
  12818 ?        00:00:00 prted

And in a term on the other node:

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  21891 ?        00:00:00 prte
  21894 ?        00:00:00 srun
  21895 ?        00:00:00 srun
  21906 ?        00:00:00 prted
  21907 ?        00:00:00 prted

But in either of these terms prun still fails, albeit differently than before:

shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available

Using salloc instead of srun:

shell$ salloc -N 2 -n 2 prte
salloc: Granted job allocation 43
DVM ready

Another term on the same node now has no prted (I notice that only the term where salloc is run has the envars):

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  13014 pts/0    00:00:00 salloc
  13018 pts/0    00:00:00 prte
  13021 pts/0    00:00:00 srun
  13024 pts/0    00:00:00 srun

But the term on the other node does have prted:

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  23189 ?        00:00:00 prted
  23190 ?        00:00:00 prted

prun attempt on both nodes is unchanged:

shell$ prun hello_world
prun failed to initialize, likely due to no DVM being available

I've been reading the man pages and trying different options, but this is eluding me. Sorry.

rhc54 commented 5 months ago

No worries - it's a simple misunderstanding. I need to add material to the docs so this is easier.

The problem is that you cannot start multiple copies of prte or else prun will get confused. The reason lies in the architecture of the system. prte is the Distributed Virtual Machine (DVM) controller - there can only be one instance of the controller. Once you invoke prte, it will look at the envars to discover the allocation, and then it will launch a prted instance on each node of the allocation. You don't have to explicitly do anything.

So what you want to do is:

$ salloc -N 2
$ prte --daemonize
$ prun <myapp>
...do whatever  you want...
$ pterm (to terminate the DVM)

See if that works for you!

efweber999 commented 5 months ago

Something else must be incorrect in my setup. I restarted the SLURM daemons with -c to have a clean start.

shell$ salloc -N 2
salloc: Granted job allocation 50

Process started on that node:

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  13811 pts/0    00:00:00 salloc
  13815 pts/0    00:00:00 bash

Checked the envars (should the "(x2)" be there?):

shell $ env | grep SLURM
SLURM_TASKS_PER_NODE=16(x2)
SLURM_SUBMIT_DIR=/home/ubuntu
SLURM_CLUSTER_NAME=cluster
SLURM_JOB_CPUS_PER_NODE=16(x2)
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=2
SLURM_JOBID=50
SLURM_NODELIST=ip-13-100-66-[218,228]
SLURM_NNODES=2
SLURM_SUBMIT_HOST=ip-13-100-66-228
SLURM_JOB_ID=50
SLURM_CONF=/usr/local/etc/slurm.conf
SLURM_JOB_NAME=interactive
SLURM_JOB_NODELIST=ip-13-100-66-[218,228]:

Launched prte:

shell$ prte --daemonize

Checked for new processes on this node:

shell$   ps -u ubuntu
    PID TTY          TIME CMD
13811 pts/0    00:00:00 salloc
  13815 pts/0    00:00:00 bash
  13842 ?        00:00:00 prte
  13845 ?        00:00:00 srun
  13848 ?        00:00:00 srun

No prted process. Checked for stared processes on the other node:

shell$ ps -u ubuntu
    PID TTY          TIME CMD
  29255 ?        00:00:00 prted
  29256 ?        00:00:00 prted

Two prted processes started. So both nodes still fail with the prun failed to initialize, likely due to no DVM being available message.

Any thoughts on what tho check next?

Thanks.

efweber999 commented 5 months ago

Oh, x2 is probably just referring to threads.

rhc54 commented 5 months ago

Here are a few things you could try:

assuming you are allocating two nodes, execute srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname and see what you get
run prte --display allocation --prtemca plm_base_verbose 5 & and see what comes out. Note this will leave prte in the background
check ls $TMPDIR to see if there is a directory that starts with prte.. This is what prun is looking for when it tries to run.

efweber999 commented 5 months ago

shell$  srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-218
ip-13-100-66-228

shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[ip-13-100-66-228:18216] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:receive start comm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm creating map
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] setup:vm: working unmanaged allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] using default hostfile /usr/local/etc/prte-default-hostfile
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setup_vm only HNP in allocation
[ip-13-100-66-228:18216] [prte-ip-13-100-66-228-18216@0,0] plm:base:setting slots for node ip-13-100-66-228 by core
DVM ready

The prte.ip-13-100-66-228.13377.1000 directory exists in the specified tmp directory, BUT $TMPDIR is not set.

I decided to look more closely at the PMIx installation log files for clues. The output from running configure --enable-mca-no-build=btl-uct --with-hwloc=/usr/local --with-libevent=/usr/local --with-pmix=/usr/local --with-slurm contains:

Transports
-----------------------
Cisco usNIC: no
HPE Slingshot: no
NVIDIA: no
OmniPath: yes
Simptest: no
TCP: no

I installed UCX as I'm hoping it will default to TCP while developing on AWS EC2 instances, and then support IB when I port code to an HPC cluster. I expected TCP to be yes, and I don't know why OmniPath is yes. So I looked through 'configure.log'. There are numerous errors in that log, but most seem to be by design. By that I mean that the config script will try something and if it results in an error it knows that something isn't available/accessible. That makes it hard for the unfamiliar to find real errors, but I did find one. On line 2528 of configure is "if (sizeof (($2)))". This causes a C compile error, (at least on my setup). I changed it to "if (sizeof ($2))" which compiles.

The interesting thing is that previously make had no errors. But fixing the configure bug exposed two make errors. One was an obvious bug that was easy to fix. Line 429 of src/util/pmix_net.c starts with "bool bool". So I deleted one of those type declarations.

The other error is"

../../../src/util/pmix_net.c:423:6: error: conflicting types for 'pmix_net_samenetwork'; have '_Bool(const struct sockaddr *, const struct sockaddr *, uint32_t)' {aka '_Bool(const struct sockaddr *, const struct sockaddr *, unsigned int)'}
  423 | bool pmix_net_samenetwork(const struct sockaddr *addr1, const struct sockaddr *addr2,
      |      ^~~~~~~~~~~~~~~~~~~~
In file included from ../../../src/util/pmix_net.c:66:
/home/ubuntu/software/sandbox_pmix-5.0.2/src/util/pmix_net.h:101:18: note: previous declaration of 'pmix_net_samenetwork' with type '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, uint32_t)' {aka '_Bool(const struct sockaddr_storage *, const struct sockaddr_storage *, unsigned int)'}
  101 | PMIX_EXPORT bool pmix_net_samenetwork(const struct sockaddr_storage *addr1,
      |                  ^~~~~~~~~~~~~~~~~~~~

I tried commenting out one of the declarations, but that caused a link error on library file, so I'm not yet understanding the correct fix.

Thanks in advance for your help. -Gene

rhc54 commented 5 months ago

Well, first off the transports shown by PMIx have nothing to do with what your MPI supports. It only indicates what transports PMIx knows about and can provide support with some info. For example, OmniPath needs a security key, and so we generate one and provide it for that environment. We don't have support for the others at this time.

The plm verbose output indicates that prte is using ssh to launch the remote daemons instead of srun - in other words, prte doesn't realize it is in a Slurm allocation. That isn't necessarily a problem as ssh will work fine, but I have no idea why it is happening.

The errors you are reporting indicate that you lack a struct sockaddr on your machine??? Pretty weird, though possible I suppose. I've corrected them upstream.

The configure error seems very strange - we don't generate that code. It comes straight out of autoconf. Nothing we can really do about it. Interestingly, I see the line if (sizeof ($2)) used in the AC_CHECK_TYPE generated code right about the one you cite. I'm guessing that perhaps you are hitting some kind of if-else situation again that causes you to encounter the error when nobody else hits it. What version of autoconf are you using?

@wenduwan @lrbison This is an Amazon user - can you perhaps help him out? I'm running out of ideas and have no access to the system.

wenduwan commented 5 months ago

@efweber999 For HPC applications on AWS we recommend https://aws.amazon.com/hpc/parallelcluster/ which pre-installs essential applications including Intel MPI, Open MPI etc., and sets up the network for you. Would you be willing to try it out?

efweber999 commented 5 months ago

Thank you for the clarifications and continued help.

My access to AWS has some restrictions, and I'm not sure if using parallelcluster is enabled/allowed. I've inquired.

Some clarification after working on this yesterday. if (sizeof (($2))) was a red herring. It generates errors, but also generates output that is more correct than when I "fix" it. Autoconf is autoconf (GNU Autoconf) 2.71.

The configure check to determine if the system is Linux with TCP fails: if test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes" $pmix_found_linux is not set, which is understandable since it appears nowhere else in the configure file. Running find . -print -type f -exec grep "pmix_found_linux" {} \; >foobar in the installation directory only finds: ./src/mca/pif/linux_ipv6/configure.m4 AS_IF([test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes"], ./configure if test "$pmix_found_sockaddr" = "yes" && test "$pmix_found_linux" = "yes" So it doesn't appear to be set anywhere. I hard coded both variables to "yes" so the test passes, but things still don't work. So, perhaps I just don't understand how configure works and this is also a red herring.

After doing that and rebuilding all the tools again I re-ran what you suggested. I admit to taking your suggestions too literally the day before and not running salloc before prte. DOH!! So those results were misleading. Here is exactly the string of commands I ran yesterday and the output.

shell$ sudo rm /var/log/slurm/* 

shell$ sudo slurmctld -c -vvvvv && sudo slurmd -c -vvvvv

shell$ ps -u root | grep slurm
  16030 ?        00:00:00 slurmctld
  16032 ?        00:00:00 slurmscriptd
  16049 ?        00:00:00 slurmstepd
  16051 ?        00:00:00 slurmd

shell$ srun --ntasks-per-node=1 --mpi=none --cpu-bind=none --ntasks=2 hostname
ip-13-100-66-228
ip-13-100-66-218

shell$ salloc -N 2
salloc: Granted job allocation 2

shell$ prte --display allocation --prtemca plm_base_verbose 5 &
[1] 16368
shell$ [ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:slurm: available for selection
[ip-13-100-66-218:16368] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive start comm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: LAUNCH DAEMONS CALLED
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm creating map
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm add new daemon [prte-ip-13-100-66-218-16368@0,1]
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:setup_vm assigning new daemon [prte-ip-13-100-66-218-16368@0,1] to node ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: launching on nodes ip-13-100-66-228
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"
srun: error: ip-13-100-66-228: task 0: Exited with exit code 213
srun: Terminating StepId=2.0
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:slurm: srun returned non-zero exit status (54528) from launching the per-node daemon
[ip-13-100-66-218:16368] [prte-ip-13-100-66-218-16368@0,0] plm:base:receive stop comm

[1]+  Exit 250

The tool installations are all scripted, and I'm willing to share that and/or any of my configuration files if that helps.

Thanks, Gene

rhc54 commented 5 months ago

I'll look at the configure code again, but that has nothing to do with this problem. I do see something troubling in your output. It appears you have some PRRTE MCA parameters set, either in the environment or perhaps in a default MCA param file??? This one in particular is bothersome:

plm_slurm_args "--external-launcher"

Is there some reason you are setting this? I suspect it is causing srun to fail.

efweber999 commented 5 months ago

I'm not setting external-launcher anywhere. I've searched through all the the configuration files and install files and find nothing remotely close.

Searching for information about that option it appears to be new in this SLURM release. https://www.schedmd.com/slurm-version-23-11rc1-is-now-available/

The online srun man page says: "This is meant for use MPI implementations that require their own launcher" The SLURM install documentation says it is passed as an argument for Intel MPI, MPICH, and MVAPICH2 when Hydra is used. So it shouldn't be applied.

I have "MpiDefault=pmix" set in my slurm.conf file. Options were: none, pmi2, cray_shasta, or pmix.

efweber999 commented 5 months ago

@wenduwan My system administrators just responded to my inquiry. "At this time, Parallelcluster is not approved".

rhc54 commented 5 months ago

Sounds like you may be out of luck, but let's try one more thing:

$ srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --external-launcher --nodes=1 --nodelist=ip-13-100-66-228 --ntasks=1 hostname

See if that aborts or runs (change the nodelist to whatever node you are allocated). This is the cmd that PRRTE is trying to use to start the remote daemon. Obviously, someone has Slurm adding that new --external-launcher option because it definitely isn't coming from us. Let's see if it is breaking us.

wenduwan commented 5 months ago

@efweber999 Thanks for confirming.

AFAIK AWS does not officially support custom installation of slurm/pmix/prrte. They should either be installed by parallel cluster, or via EFA installer.

In your case I believe the system admin has chosen a different installation - I wonder if it's possible to try the EFA installer which includes its own prrte/pmix under /opt/amazon.

The latest release 1.31.0 is built with openpmix 4.2.8 and prrte 3.0.3.

efweber999 commented 5 months ago

@rhc54 That line works fine.

rhc54 commented 5 months ago

I guess you might as well try running the rest of that cmd line on its own:

prted --prtemca ess "slurm" --prtemca ess_base_nspace "prte-ip-13-100-66-218-16368@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prte-ip-13-100-66-218-16368@0.0;tcp://13.100.66.218:35725:26" --prtemca plm_slurm_args "--external-launcher" --prtemca prte_tmpdir_base "/usr/local/tmp" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"

and see if the prted barfs. It won't be able to startup because prte itself isn't running, but the fact that the srun cmd immediately errors out implies that something in this cmd isn't happy.

efweber999 commented 5 months ago

It doesn't barf, it does nothing.

Ralph, Wenduo, Thanks for your help.

openpmix / prrte

Can't launch prted #1960

Environment

Details of the problem

Installed Packages

Built Software (In this order)