openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
35 stars 67 forks source link

Segfaulting in hwloc_get_cpubind #1400

Closed wckzhang closed 2 years ago

wckzhang commented 2 years ago

Background information

PMIx - 41f4225d6fb806ff218eb229a9a25baf5a97c5fa PRRTe - 0b580da7c8952a95a39a2cdb5d13b3453fb934ce

Please describe the system on which you are running

Amazon Linux 2 c5n.18xlarge

Details of the problem

Segfaults in hwloc_get_cpubind fairly frequently. I run this command:

[ec2-user@ip-172-31-10-115 ~]$ ~/install2/bin/mpirun  -np 16 -N 36  --leave-session-attached --prtemca plm_base_verbose 5   --hostfile ~/hostfile /bin/true
[ip-172-31-10-115:01639] [prterun-ip-172-31-10-115-1639@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install2;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install2/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install2/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install2/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-1639@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "5" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-1639@0.0;tcp://127.0.0.1,172.31.10.115:51579:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-1639@0.0;tcp://127.0.0.1,172.31.10.115:51579:8,20"
[ip-172-31-10-115:01639] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:01639]    ALIAS: 172.31.12.127
[ip-172-31-10-115:01639] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:01639]    ALIAS: 172.31.13.250
Segmentation fault (core dumped)

Sometimes it also throws this error:

--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: /bin/true
  Number of procs:  16
  Procs mapped:  16
  Total number of procs:  36
  PPR: 36:node

Please revise the conflict and try again.
--------------------------------------------------------------------------
rhc54 commented 2 years ago

Well, it isn't illegal - but that's a pretty weird cmd line. Run 36ppn but only launch 16? If you just tell it --np 16, does it work?

I suspect the mapper is getting confused, hence the question.

wckzhang commented 2 years ago

I still get the segfault with:

[ec2-user@ip-172-31-10-115 ~]$ ~/install2/bin/mpirun  --np 16  --leave-session-attached --prtemca plm_base_verbose 5   --hostfile ~/hostfile /bin/true
[ip-172-31-10-115:07266] [prterun-ip-172-31-10-115-7266@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install2;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install2/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install2/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install2/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-7266@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "5" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-7266@0.0;tcp://127.0.0.1,172.31.10.115:56949:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-7266@0.0;tcp://127.0.0.1,172.31.10.115:56949:8,20"
[ip-172-31-10-115:07266] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:07266]    ALIAS: 172.31.12.127
[ip-172-31-10-115:07266] ALIASES FOR NODE compute-st-c5n18xlarge-1 (compute-st-c5n18xlarge-1)
[ip-172-31-10-115:07266]    ALIAS: 172.31.4.124
Segmentation fault

I'm not getting the warning. (The -N 36 is because I was finicking with different np counts, this issue occurs fairly frequently but not 100% of the time).

rhc54 commented 2 years ago

Okay, anything different about these machines? They all have at least 16 cores? Can you give me a backtrace or something to work with?

Obviously, I'm not able to reproduce this locally, so I need something to work with here.

wckzhang commented 2 years ago

Yeah the backtrace is

(gdb) bt
#0  0x00007f22712672d7 in hwloc_get_cpubind () from /lib64/libhwloc.so.5
#1  0x00007f2272884f31 in prte_hwloc_base_setup_summary () from /home/ec2-user/install2/lib/libprrte.so.2
#2  0x00007f22728eb609 in prte_plm_base_daemon_callback () from /home/ec2-user/install2/lib/libprrte.so.2
#3  0x00007f22728897b6 in prte_rml_base_process_msg () from /home/ec2-user/install2/lib/libprrte.so.2
#4  0x00007f22716a13ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#5  0x0000000000404298 in main ()

These machines (c5n.18xlarge) have 36 cores, the hostfile also has 4 different nodes, so --np 16 shouldn't have an issue.

wckzhang commented 2 years ago

There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault

rhc54 commented 2 years ago

Wait - what?

There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault

So where is this problem coming from? The v5 branch? If so, we know that PRRTE/PMIx code is stale - not worth trying to chase this down.

wckzhang commented 2 years ago

Wait - what?

There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault

So where is this problem coming from? The v5 branch? If so, we know that PRRTE/PMIx code is stale - not worth trying to chase this down.

This was with PR #10611 (Open MPI Main plus the new submodule pointers) Without the submodule pointer updates it works fine

wckzhang commented 2 years ago

Since #10611 has been merged, let me try with head of main again

rhc54 commented 2 years ago

Configure it with --enable-debug so we can get line numbers if it continues to segfault.

wckzhang commented 2 years ago

I am receiving those segfaults on ompi main now: command:

[ec2-user@ip-172-31-10-115 ompi]$ ~/install/bin/mpirun --np 2 --hostfile ~/hostfile /bin/true
Segmentation fault (core dumped)

backtrace:

#0  0x00007f0a967c92d7 in hwloc_get_cpubind () from /lib64/libhwloc.so.5
#1  0x00007f0a97f191e0 in prte_hwloc_base_setup_summary (topo=0x0) at hwloc/hwloc_base_util.c:173
#2  0x00007f0a97f1930e in prte_hwloc_base_filter_cpus (topo=0x0) at hwloc/hwloc_base_util.c:211
#3  0x00007f0a97fbf525 in prte_plm_base_daemon_callback (status=0, sender=0x16bd930, buffer=0x16bda40, tag=10, cbdata=0x0) at base/plm_base_launch_support.c:1683
#4  0x00007f0a97f211e9 in prte_rml_base_process_msg (fd=-1, flags=4, cbdata=0x16bd800) at rml/rml_base_msg_handlers.c:202
#5  0x00007f0a96c033ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#6  0x0000000000406594 in main (argc=6, argv=0x7ffd1b1e46c8) at prte.c:727
rhc54 commented 2 years ago

Pretty easy to see why - the topo being passed to "setup_summary" is NULL. Question is why. Perhaps add --prtemca plm_base_verbose 5 to the cmd line and let's see which daemon is calling back and causing the problem.

wckzhang commented 2 years ago

This seems to be an issue when I have a hostfile with at least 2 nodes in it.

[ec2-user@ip-172-31-10-115 ompi]$ ~/install/bin/mpirun --np 2 --prtemca plm_base_verbose 5 --hostfile ~/hostfile /bin/true
[ip-172-31-10-115:18049] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:receive start comm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm creating map
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] setup:vm: working unmanaged allocation
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] using hostfile /home/ec2-user/hostfile
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] checking node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] checking node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm add new daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm assigning new daemon [prterun-ip-172-31-10-115-18049@0,1] to node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm add new daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm assigning new daemon [prterun-ip-172-31-10-115-18049@0,2] to node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: launching vm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: local shell: 0 (bash)
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: assuming same remote shell as local shell
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: remote shell: 0 (bash)
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh:launch daemon 0 not a child of mine
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: adding node compute-st-c5n18xlarge-3 to launch list
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: adding node compute-st-c5n18xlarge-4 to launch list
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: activating launch event
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: recording launch of daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: recording launch of daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh compute-st-c5n18xlarge-3 PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh compute-st-c5n18xlarge-4 PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid 2 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"]
Warning: Permanently added 'compute-st-c5n18xlarge-3,172.31.13.250' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-4,172.31.12.127' (ECDSA) to the list of known hosts.
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,2] on node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:18049]    ALIAS: 172.31.12.127
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] REQUESTING TOPOLOGY FROM [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,1] on node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:18049]    ALIAS: 172.31.13.250
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] TOPOLOGY ALREADY RECORDED
Segmentation fault (core dumped)
rhc54 commented 2 years ago

Okay, I see what's going on - the second node is responding before the first one, and that creates a race condition that you lose. Let me poke at it a bit - shouldn't be too hard to fix.

rhc54 commented 2 years ago

Hmmm...it appears that the node where you are running mpirun must have a different topology than the other two nodes - correct?

wckzhang commented 2 years ago

Yes

wckzhang commented 2 years ago

(head node is small node, compute nodes are in hostfile)

wckzhang commented 2 years ago

Running mpirun from the compute node, I'm not hitting this segfault either, so that's most likely the issue

rhc54 commented 2 years ago

Funny thing is that this race condition has existed for many, many years - dating way back into the ORTE days. Even though you were losing the race all that time, it didn't harm anything other than causing you to request and store extra topologies - an efficiency loss, but not a segfault so you wouldn't have noticed unless you were looking carefully at the verbose output.

The fix was a tad more involved than I had hoped, but ultimately seems to be working (I was able to hack things up enough to reproduce the problem locally). Please give it a shot when you get a chance.

wckzhang commented 2 years ago

Updated to prrte master head. Now I'm seeing a hang instead of a segfault: Command:

while ~/install/bin/mpirun  --np 2 --prtemca plm_base_verbose 5 --hostfile ~/hostfile /bin/true; do :; done
.
.
.
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,1] on node compute-st-c5n18xlarge-1
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-1 (compute-st-c5n18xlarge-1)
[ip-172-31-10-115:19307]    ALIAS: 172.31.4.124
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-1
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] REQUESTING TOPOLOGY FROM [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,2]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,2] on node compute-st-c5n18xlarge-2
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-2 (compute-st-c5n18xlarge-2)
[ip-172-31-10-115:19307]    ALIAS: 172.31.12.27
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-2
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:daemon_topology recvd for daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted:report_topo launch completed for daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch recvd 2 of 5 reported daemons
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,4]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,4] on node compute-st-c5n18xlarge-4
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:19307]    ALIAS: 172.31.12.127
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-4
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch completed for daemon [prterun-ip-172-31-10-115-19307@0,4] at contact (null)
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch job prterun-ip-172-31-10-115-19307@0 recvd 3 of 5 reported daemons
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,3]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,3] on node compute-st-c5n18xlarge-3
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:19307]    ALIAS: 172.31.13.250
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-3
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch completed for daemon [prterun-ip-172-31-10-115-19307@0,3] at contact (null)
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch job prterun-ip-172-31-10-115-19307@0 recvd 4 of 5 reported daemons
<hangs here>
rhc54 commented 2 years ago

Sigh - I wish I could get access to these bloody machines! What is the hostfile?

wckzhang commented 2 years ago

I can let you access my head node if that makes it easier, what e-mail do you use now? (Doesn't look like you're in open-mpi slack anymore)

rhc54 commented 2 years ago

rhc at pmix.org

rhc54 commented 2 years ago

errr...what did I do differently? I just tried this:

mpirun -np 2 --hostfile hostfile --prtemca plm_base_verbose 5 /bin/true

on your machine and it ran perfectly. I set the following in the environ:

$ export PATH=/home/ec2-user/install/bin:$PATH
$ export LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH
$ which mpirun
~/install/bin/mpirun
wckzhang commented 2 years ago

@rhc54 It doesn't occur on every run, that's why I set the loop. It occurs maybe once every 20 runs.

rhc54 commented 2 years ago

Okay, I got it - thanks!