Segfault when using a hostfile in a SLURM reservation

mkre commented 4 days ago

Background information

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

v3.0.7

What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)

v5.0.3

Details of the problem

We are hitting a segfault with Open MPI 5.0.5 which seems to be in PRRTE, which I can also reproduce when manually injecting v3.0.7 instead of the default version shipped with Open MPI 5.0.5.

8-node job on a SLURM cluster with ppn=60 salloc -N 8 --ntasks-per-node 60
Within the job, generate a hostfile containing the first 2 (or more) nodes: scontrol show hostname | head -n2 > hostfile.txt

Observe the error:

$ bin/prterun --hostfile hostfile.txt hostname
free(): invalid next size (fast)
Aborted (core dumped)

What works:
- Disabling Slurm tight integration with bin/prterun --hostfile hostfile.txt --prtemca ras ^slurm hostname
- Having only 1 host in the hostfile: scontrol show hostname | head -n1 > hostfile.txt
- Not using a hostfile at all: bin/prterun --hostfile hostfile.txt hostname

Some debugging

Valgrind output: valgrind.txt
The issue seems to be in this file: https://github.com/openpmix/prrte/blob/v3.0.7/src/util/nidmap.c
Using GDB also takes us to nidmap.c:209
vpids is allocated with a size of 8 bytes here: https://github.com/openpmix/prrte/blob/v3.0.7/src/util/nidmap.c#L82
```
(gdb) p nbytes
$1 = 8
(gdb) p prte_process_info.num_daemons
$3 = 2
(gdb) p sizeof(pmix_rank_t)
$4 = 4
```
vpids[8] is accessed here: https://github.com/openpmix/prrte/blob/v3.0.7/src/util/nidmap.c#L105
```
(gdb) p ndaemons
$2 = 8
```
Judging from the code comment at https://github.com/openpmix/prrte/blob/v3.0.7/src/util/nidmap.c#L73-L80, perhaps we shouldn’t multiply prte_process_info.num_daemons when computing nbytes? It looks like prte_process_info.num_daemons is deduced from the hostfile (which contains two hosts), but maybe we should rather multiply with the number of hosts in the entire SLURM reservation to avoid the segfault downstream?
When running with --prtemca ras ^slurm, we never hit the case that ndaemons == 8, so we never access invalid memory in vpids (pool->size is still 64, so it looks like we hit the continue statement in https://github.com/openpmix/prrte/blob/v3.0.7/src/util/nidmap.c#L87 for all ndaemons >= 8.

rhc54 commented 4 days ago

I'll try to find some time to take a look. Please be advised: as I stated on the mailing list multiple times, PRRTE releases are approaching an end-state position. In other words, there are no plans for continued releases going forward.

So testing release candidates is CRITICAL - otherwise, even if I do produce a fix (which won't be on any guaranteed timeline), you will not see it in an official release. It will just be on the branch in the GitHub repo.

Please factor that into your plans. I believe there will indeed be ONE more release as there are (non-pmix/prrte) problems in OMPI that necessitate a re-release of their latest version, so I'll provide an update for their use.

mkre commented 4 days ago

Thanks for taking a look!

Does that imply that going forward, we will only have PRRTE available from within the Open MPI tarballs (just found this page which sounds like that's the case)? That is absolutely fine with me, testing this bug was the first time I have ever downloaded a standalone PRRTE release. Also, I expect the fix for this bug to be rather limited so I should be able to easily cherry-pick it into my Open MPI 5.0.5 sources.

I have tested multiple early Open MPI 5 RCs, but this is more of a corner case issues so I haven't caught it before.

rhc54 commented 4 days ago

Does that imply that going forward, we will only have PRRTE available from within the Open MPI tarballs

Absolutely not. PRRTE remains a separate, independent project. There are people working on it for use in truly dynamic environments (i.e., when schedulers are able to shift resources on-the-fly, add/subtract from running jobs, etc).

There are a lot of talks ongoing about what to do with PRRTE relative to OMPI (the page you cite was just some thoughts - I for one am not convinced that approach is viable), and what to do with PMIx in general. I'm trying very hard to finally "retire" (after several years of not doing very well at it), and so I have drawn a line in the sand with respect to providing immediate response to bug reports. However, I (and others) recognize that both PMIx and PRRTE have cemented a place in the HPC/computing environment, and folks are talking about how to ensure long-term support for those packages. The conclusion, though, is not yet known.

So what I'm trying to do is get both PMIx and PRRTE into a stable landing zone, generating one final "known good" official production release. Once that happens, then slow development will incrementally move the repos - but not result in new releases. If/when folks come up with a working plan for more active support, then that's great.

I'm just not willing to bet on it quite yet. 🤷‍♂️

rhc54 commented 4 days ago

What works:

Disabling Slurm tight integration with bin/prterun --hostfile hostfile.txt --prtemca ras ^slurm hostname

This doesn't disable Slurm integration. It just tells PRRTE to not try to obtain the allocation from Slurm. We would still use the Slurm-integrated launcher.

Not using a hostfile at all: bin/prterun --hostfile hostfile.txt hostname

I'm confused - that cmd line specifically does use a hostfile. I suspect this is a typo?

The issue seems to be in this file: (nidmap)

Actually, I believe the problem may lie elsewhere, probably in the Slurm plm (the launcher component) or in the mapper. If I create a default hostfile, it looks internally just like what I'd end up with if I read the allocation from Slurm. I then create and pass a hostfile with some subset of the nodes - and everything works just fine.

So the Slurm components are passing incorrect info somewhere to the rest of PRRTE. Not sure why as nothing has changed there in...nearly forever? Could be something has changed with Slurm (again...sigh). What version of Slurm are you using?

rhc54 commented 4 days ago

Well, I've spent hours beating my head against the wall trying to get Slurm to run on my containers - without success. It is the most arcane system I've ever encountered. Afraid I must give up and will have to rely on you to help debug.

If you can, please add --prtemca plm_base_verbose 5 to your prterun (or mpirun, if you prefer) cmd line and report the output here.

mkre commented 3 days ago

Hi @rhc54,

Thanks for providing some background information on the state and future of PRRTE.

I'm confused - that cmd line specifically does use a hostfile. I suspect this is a typo?

Yes, that's a stupid typo.

So the Slurm components are passing incorrect info somewhere to the rest of PRRTE. Not sure why as nothing has changed there in...nearly forever? Could be something has changed with Slurm (again...sigh). What version of Slurm are you using?

We are using Slurm 21.08.6. Please note that Open MPI 4.1.5 and the default bundled PMIX and PRRTE are working just fine on the same system, so it looks like a regression in one of these components rather than a SLURM change.

Afraid I must give up and will have to rely on you to help debug.

I am happy to help :)

If you can, please add --prtemca plm_base_verbose 5 to your prterun (or mpirun, if you prefer) cmd line and report the output here.

$ bin/stock/prterun --hostfile hostfile.txt --prtemca plm_base_verbose 5 hostname
[snnhpc02n009:1836625] [[INVALID],UNDEFINED] plm:slurm: available for selection
[snnhpc02n009:1836625] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:receive start comm
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: LAUNCH DAEMONS CALLED
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:setup_vm
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:setup_vm creating map
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:setup_vm add new daemon [prterun-snnhpc02n009-1836625@0,1]
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:setup_vm assigning new daemon [prterun-snnhpc02n009-1836625@0,1] to node snnhpc02n018
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: launching on nodes snnhpc02n018
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: Set prefix:/u/ydfb4q/TPD-219/STAR-CCM+20.01.068-8-g837d9e2/star/.nexus/mpi/openmpi/5.0.5-cda-001/linux-x86_64-2.28/gnu11.2-prrte-3.0.7
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --cpu-bind=none --nodes=1 --nodelist=snnhpc02n018 --ntasks=1 prted --prtemca ess "slurm" --prtemca ess_base_nspace "prterun-snnhpc02n009-1836625@0" --prtemca ess_base_vpid "1" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-snnhpc02n009-1836625@0.0;tcp://10.144.0.9,10.145.0.9:58181:16,16" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --pmixmca mca_base_component_show_load_errors "0" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1"
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: reset PATH: /u/ydfb4q/TPD-219/STAR-CCM+20.01.068-8-g837d9e2/star/.nexus/mpi/openmpi/5.0.5-cda-001/linux-x86_64-2.28/gnu11.2-prrte-3.0.7/bin:/u/ydfb4q/spack/bin:/panfs/snnhpc02panfs/u/ydfb4q/spack/opt/spack/linux-rhel8-zen2/gcc-11.2.0/environment-modules-5.2.0-26mlrcomr6xbrgafhw5sdh6ejowg2avc/bin:/u/ydfb4q/spack/bin:/cm/shared/apps/slurm/current/sbin:/cm/shared/apps/slurm/current/bin:/cm/local/apps/gcc/11.2.0/bin:/u/ydfb4q/.local/bin:/u/ydfb4q/bin:/cm/local/apps/environment-modules/4.5.3//bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/4.5.3/bin:/opt/dell/srvadmin/bin:/panfs/snnhpc02panfs/u/ydfb4q/HPC-4181/prrte-3.0.7/install/bin/:/panfs/snnhpc02panfs/u/ydfb4q/HPC-4181/prrte-3.0.7/install/bin/
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:slurm: reset LD_LIBRARY_PATH: /u/ydfb4q/TPD-219/STAR-CCM+20.01.068-8-g837d9e2/star/.nexus/mpi/openmpi/5.0.5-cda-001/linux-x86_64-2.28/gnu11.2-prrte-3.0.7/lib:/panfs/snnhpc02panfs/install/STAR-CCMP/lin64/20.01.077_01/STAR-CCM+20.01.077/mpi/openmpi/4.1.5-cda-003/linux-x86_64-2.17/gnu11.2/lib:/panfs/snnhpc02panfs/u/ydfb4q/TPD-219/STAR-CCM+20.01.068-8-g837d9e2/star/.nexus/mpi/openmpi/5.0.5-cda-001/linux-x86_64-2.28/gnu11.2-prrte-3.0.7/lib:/panfs/snnhpc02panfs/u/ydfb4q/HPC-4181/hwloc-2.11.2/install/lib:/panfs/snnhpc02panfs/u/ydfb4q/HPC-4181/hwloc-2.11.2/install/lib:/cm/shared/apps/slurm/current/lib64/slurm:/cm/shared/apps/slurm/current/lib64:/cm/local/apps/gcc/11.2.0/lib:/cm/local/apps/gcc/11.2.0/lib64:/cm/local/apps/gcc/11.2.0/lib32
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:orted_report_launch from daemon [prterun-snnhpc02n009-1836625@0,1]
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:orted_report_launch from daemon [prterun-snnhpc02n009-1836625@0,1] on node snnhpc02n018
[snnhpc02n009:1836625] ALIASES FOR NODE snnhpc02n018 (snnhpc02n018)
[snnhpc02n009:1836625]  ALIAS: 10.144.0.18
[snnhpc02n009:1836625]  ALIAS: 10.145.0.18
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] RECEIVED TOPOLOGY SIG 8N:2S:32L3:64L2:64L1:64C:64H:0-63::x86_64:le FROM NODE snnhpc02n018
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED IN POSN 0
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:orted_report_launch completed for daemon [prterun-snnhpc02n009-1836625@0,1] at contact prterun-snnhpc02n009-1836625@0.1;tcp://10.144.0.18,10.145.0.18:43835:16,16
[snnhpc02n009:1836625] [prterun-snnhpc02n009-1836625@0,0] plm:base:orted_report_launch job prterun-snnhpc02n009-1836625@0 recvd 2 of 2 reported daemons
free(): invalid next size (fast)
Aborted (core dumped)
[ydfb4q@snnhpc02n009 gnu11.2-prrte-3.0.7]$ srun: error: snnhpc02n018: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=173472.9

For reference, I am logged in to snnhpc02n009 and the hostfile is:

snnhpc02n009
snnhpc02n018

I just noticed another thing: Unlike stated above, the problem also appears with a single host in the hostfile, but it must not be the host I am logged in to. So, if I just keep the second line I can also reproduce it with a single host in the hostfile. Still, the reservation must have 8 or more nodes in order for the problem to appear.

rhc54 commented 3 days ago

Yeah, I was asking about Slurm version because we know there was a major change there in 23.11 that can cause problems, so just wanted to see if we were dealing with that scenario.

Interesting output - shows that you actually launched the daemons just fine. Looks like the problem comes when you subsequently attempt to launch the app, which really has nothing to do with Slurm. Puzzling as it should affect all launch environments, but clearly only impacts this one. I wonder if it has something to do with parsing your node names? May have to see if I can replicate that pattern. It is odd as the nidmap code you pointed at hasn't changed in years, including migrating across the OMPI v4 and v5 series.

Open MPI 4.1.5 and the default bundled PMIX and PRRTE are working just fine

I'm a little confused by this statement. OMPI v4 doesn't use PRRTE. I gather that OMPI v4 works fine as-is? Are you also saying that OMPI v5 works fine with the bundled PMIx/PRRTE versions?

Could you check that you configured PRRTE with --enable-debug just so we get all the available output? Then let's add --prtemca rmaps_base_verbose 5 --prtemca state_base_verbose 5 to the cmd line and see if we can get more info as to where the problem is being hit.

mkre commented 3 days ago

I'm a little confused by this statement. OMPI v4 doesn't use PRRTE. I gather that OMPI v4 works fine as-is?

Yes, Open MPI 4.1.5 is working fine as-is.

Are you also saying that OMPI v5 works fine with the bundled PMIx/PRRTE versions?

No, Open MPI 5.0.5 is not working fine as-is. I narrowed the problem down to PRRTE (I think) and confirmed it with the latest version (copied into the Open MPI source tree before building), which is why I have opened the ticket here instead of Open MPI.

I have configured Open MPI with --enable-debug --enable-mem-debug, which I think should propagate down to PRRTE.

Then let's add --prtemca rmaps_base_verbose 5 --prtemca state_base_verbose 5 to the cmd line and see if we can get more info as to where the problem is being hit.

I'll report back on these tests once I've gotten an 8-node job on the cluster.

rhc54 commented 3 days ago

Hold on a bit on the tests. Looking more at your gdb results and the code itself, I can see where memory corruption would occur. What I cannot yet understand is why it wouldn't happen every time an allocation was subdivided - yet I cannot reproduce it. There are a couple of places I can look still, so I'll poke around some more first. Hopefully have a patch you can try soon.

rhc54 commented 3 days ago

Can you please give the following patch a try:

diff --git a/src/util/nidmap.c b/src/util/nidmap.c
index 9abc8ad76e..fad223dbc7 100644
--- a/src/util/nidmap.c
+++ b/src/util/nidmap.c
@@ -86,6 +86,9 @@ int prte_util_nidmap_create(pmix_pointer_array_t *pool, pmix_data_buffer_t *buff
         if (NULL == (nptr = (prte_node_t *) pmix_pointer_array_get_item(pool, n))) {
             continue;
         }
+        if (NULL == nptr->daemon) {
+            continue;
+        }
         /* add the hostname to the argv */
         PMIX_ARGV_APPEND_NOSIZE_COMPAT(&names, nptr->name);
         als = NULL;
@@ -101,11 +104,7 @@ int prte_util_nidmap_create(pmix_pointer_array_t *pool, pmix_data_buffer_t *buff
             PMIX_ARGV_APPEND_NOSIZE_COMPAT(&aliases, "PRTENONE");
         }
         /* store the vpid */
-        if (NULL == nptr->daemon) {
-            vpids[ndaemons] = PMIX_RANK_INVALID;
-        } else {
-            vpids[ndaemons] = nptr->daemon->name.rank;
-        }
+        vpids[ndaemons] = nptr->daemon->name.rank;
         ++ndaemons;
     }

@@ -398,22 +397,20 @@ int prte_util_decode_nidmap(pmix_data_buffer_t *buf)
         /* set the topology - always default to homogeneous
          * as that is the most common scenario */
         nd->topology = t;
-        /* see if it has a daemon on it */
-        if (PMIX_RANK_INVALID != vpid[n]) {
-            proc = (prte_proc_t *) pmix_pointer_array_get_item(daemons->procs, vpid[n]);
-            if (NULL == proc) {
-                proc = PMIX_NEW(prte_proc_t);
-                PMIX_LOAD_PROCID(&proc->name, PRTE_PROC_MY_NAME->nspace, vpid[n]);
-                proc->state = PRTE_PROC_STATE_RUNNING;
-                PRTE_FLAG_SET(proc, PRTE_PROC_FLAG_ALIVE);
-                daemons->num_procs++;
-                pmix_pointer_array_set_item(daemons->procs, proc->name.rank, proc);
-            }
-            PMIX_RETAIN(nd);
-            proc->node = nd;
-            PMIX_RETAIN(proc);
-            nd->daemon = proc;
+        /* record the daemon on it */
+        proc = (prte_proc_t *) pmix_pointer_array_get_item(daemons->procs, vpid[n]);
+        if (NULL == proc) {
+            proc = PMIX_NEW(prte_proc_t);
+            PMIX_LOAD_PROCID(&proc->name, PRTE_PROC_MY_NAME->nspace, vpid[n]);
+            proc->state = PRTE_PROC_STATE_RUNNING;
+            PRTE_FLAG_SET(proc, PRTE_PROC_FLAG_ALIVE);
+            daemons->num_procs++;
+            pmix_pointer_array_set_item(daemons->procs, proc->name.rank, proc);
         }
+        PMIX_RETAIN(nd);
+        proc->node = nd;
+        PMIX_RETAIN(proc);
+        nd->daemon = proc;
     }

     /* update num procs */

mkre commented 3 days ago

I'll try that early next week and report back here as soon as possible.

mkre commented 22 hours ago

Just to close the loop here: Your patch fixes the issue. Thanks for the quick turnaround!

rhc54 commented 16 hours ago

Thanks! Just to clarify the problem: this only hit when someone subdivided an allocation in a managed environment (i.e., one that was defined by a scheduler), which is why I had trouble reproducing it. I was able to finally do so by hacking the allocation code to fake it into thinking a scheduler had assigned the allocation.

I could see how/when the bug got there, and why it was initially introduced (changes elsewhere in the code that were later revised out). Oddly enough, it appears this isn't something people do a lot as it remained undiscovered for quite a long time!

So thanks again for finding and reporting it.

openpmix / prrte