Closed wckzhang closed 2 years ago
Well, it isn't illegal - but that's a pretty weird cmd line. Run 36ppn but only launch 16? If you just tell it --np 16
, does it work?
I suspect the mapper is getting confused, hence the question.
I still get the segfault with:
[ec2-user@ip-172-31-10-115 ~]$ ~/install2/bin/mpirun --np 16 --leave-session-attached --prtemca plm_base_verbose 5 --hostfile ~/hostfile /bin/true
[ip-172-31-10-115:07266] [prterun-ip-172-31-10-115-7266@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install2;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install2/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install2/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install2/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-7266@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "5" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-7266@0.0;tcp://127.0.0.1,172.31.10.115:56949:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-7266@0.0;tcp://127.0.0.1,172.31.10.115:56949:8,20"
[ip-172-31-10-115:07266] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:07266] ALIAS: 172.31.12.127
[ip-172-31-10-115:07266] ALIASES FOR NODE compute-st-c5n18xlarge-1 (compute-st-c5n18xlarge-1)
[ip-172-31-10-115:07266] ALIAS: 172.31.4.124
Segmentation fault
I'm not getting the warning. (The -N 36 is because I was finicking with different np counts, this issue occurs fairly frequently but not 100% of the time).
Okay, anything different about these machines? They all have at least 16 cores? Can you give me a backtrace or something to work with?
Obviously, I'm not able to reproduce this locally, so I need something to work with here.
Yeah the backtrace is
(gdb) bt
#0 0x00007f22712672d7 in hwloc_get_cpubind () from /lib64/libhwloc.so.5
#1 0x00007f2272884f31 in prte_hwloc_base_setup_summary () from /home/ec2-user/install2/lib/libprrte.so.2
#2 0x00007f22728eb609 in prte_plm_base_daemon_callback () from /home/ec2-user/install2/lib/libprrte.so.2
#3 0x00007f22728897b6 in prte_rml_base_process_msg () from /home/ec2-user/install2/lib/libprrte.so.2
#4 0x00007f22716a13ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#5 0x0000000000404298 in main ()
These machines (c5n.18xlarge) have 36 cores, the hostfile also has 4 different nodes, so --np 16 shouldn't have an issue.
There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault
Wait - what?
There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault
So where is this problem coming from? The v5 branch? If so, we know that PRRTE/PMIx code is stale - not worth trying to chase this down.
Wait - what?
There's nothing particularly different with these machines, running on OMPI main branch prrte & pmix doesn't encounter this segfault
So where is this problem coming from? The v5 branch? If so, we know that PRRTE/PMIx code is stale - not worth trying to chase this down.
This was with PR #10611 (Open MPI Main plus the new submodule pointers) Without the submodule pointer updates it works fine
Since #10611 has been merged, let me try with head of main again
Configure it with --enable-debug
so we can get line numbers if it continues to segfault.
I am receiving those segfaults on ompi main now: command:
[ec2-user@ip-172-31-10-115 ompi]$ ~/install/bin/mpirun --np 2 --hostfile ~/hostfile /bin/true
Segmentation fault (core dumped)
backtrace:
#0 0x00007f0a967c92d7 in hwloc_get_cpubind () from /lib64/libhwloc.so.5
#1 0x00007f0a97f191e0 in prte_hwloc_base_setup_summary (topo=0x0) at hwloc/hwloc_base_util.c:173
#2 0x00007f0a97f1930e in prte_hwloc_base_filter_cpus (topo=0x0) at hwloc/hwloc_base_util.c:211
#3 0x00007f0a97fbf525 in prte_plm_base_daemon_callback (status=0, sender=0x16bd930, buffer=0x16bda40, tag=10, cbdata=0x0) at base/plm_base_launch_support.c:1683
#4 0x00007f0a97f211e9 in prte_rml_base_process_msg (fd=-1, flags=4, cbdata=0x16bd800) at rml/rml_base_msg_handlers.c:202
#5 0x00007f0a96c033ad in event_base_loop () from /lib64/libevent_core-2.0.so.5
#6 0x0000000000406594 in main (argc=6, argv=0x7ffd1b1e46c8) at prte.c:727
Pretty easy to see why - the topo being passed to "setup_summary" is NULL. Question is why. Perhaps add --prtemca plm_base_verbose 5
to the cmd line and let's see which daemon is calling back and causing the problem.
This seems to be an issue when I have a hostfile with at least 2 nodes in it.
[ec2-user@ip-172-31-10-115 ompi]$ ~/install/bin/mpirun --np 2 --prtemca plm_base_verbose 5 --hostfile ~/hostfile /bin/true
[ip-172-31-10-115:18049] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:receive start comm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm creating map
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] setup:vm: working unmanaged allocation
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] using hostfile /home/ec2-user/hostfile
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] checking node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] checking node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm add new daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm assigning new daemon [prterun-ip-172-31-10-115-18049@0,1] to node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm add new daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:setup_vm assigning new daemon [prterun-ip-172-31-10-115-18049@0,2] to node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: launching vm
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: local shell: 0 (bash)
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: assuming same remote shell as local shell
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: remote shell: 0 (bash)
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh:launch daemon 0 not a child of mine
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: adding node compute-st-c5n18xlarge-3 to launch list
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: adding node compute-st-c5n18xlarge-4 to launch list
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: activating launch event
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: recording launch of daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: recording launch of daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh compute-st-c5n18xlarge-3 PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh compute-st-c5n18xlarge-4 PRTE_PREFIX=/home/ec2-user/install;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/ec2-user/install/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/ec2-user/install/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-ip-172-31-10-115-18049@0" --prtemca ess_base_vpid 2 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-ip-172-31-10-115-18049@0.0;tcp://127.0.0.1,172.31.10.115:58999:8,20"]
Warning: Permanently added 'compute-st-c5n18xlarge-3,172.31.13.250' (ECDSA) to the list of known hosts.
Warning: Permanently added 'compute-st-c5n18xlarge-4,172.31.12.127' (ECDSA) to the list of known hosts.
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,2] on node compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:18049] ALIAS: 172.31.12.127
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-4
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] REQUESTING TOPOLOGY FROM [prterun-ip-172-31-10-115-18049@0,2]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,1]
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-18049@0,1] on node compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:18049] ALIAS: 172.31.13.250
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-3
[ip-172-31-10-115:18049] [prterun-ip-172-31-10-115-18049@0,0] TOPOLOGY ALREADY RECORDED
Segmentation fault (core dumped)
Okay, I see what's going on - the second node is responding before the first one, and that creates a race condition that you lose. Let me poke at it a bit - shouldn't be too hard to fix.
Hmmm...it appears that the node where you are running mpirun
must have a different topology than the other two nodes - correct?
Yes
(head node is small node, compute nodes are in hostfile)
Running mpirun from the compute node, I'm not hitting this segfault either, so that's most likely the issue
Funny thing is that this race condition has existed for many, many years - dating way back into the ORTE days. Even though you were losing the race all that time, it didn't harm anything other than causing you to request and store extra topologies - an efficiency loss, but not a segfault so you wouldn't have noticed unless you were looking carefully at the verbose output.
The fix was a tad more involved than I had hoped, but ultimately seems to be working (I was able to hack things up enough to reproduce the problem locally). Please give it a shot when you get a chance.
Updated to prrte master head. Now I'm seeing a hang instead of a segfault: Command:
while ~/install/bin/mpirun --np 2 --prtemca plm_base_verbose 5 --hostfile ~/hostfile /bin/true; do :; done
.
.
.
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,1] on node compute-st-c5n18xlarge-1
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-1 (compute-st-c5n18xlarge-1)
[ip-172-31-10-115:19307] ALIAS: 172.31.4.124
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-1
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] REQUESTING TOPOLOGY FROM [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,2]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,2] on node compute-st-c5n18xlarge-2
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-2 (compute-st-c5n18xlarge-2)
[ip-172-31-10-115:19307] ALIAS: 172.31.12.27
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-2
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] NEW TOPOLOGY - ADDING
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:daemon_topology recvd for daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted:report_topo launch completed for daemon [prterun-ip-172-31-10-115-19307@0,1]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch recvd 2 of 5 reported daemons
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,4]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,4] on node compute-st-c5n18xlarge-4
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-4 (compute-st-c5n18xlarge-4)
[ip-172-31-10-115:19307] ALIAS: 172.31.12.127
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-4
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch completed for daemon [prterun-ip-172-31-10-115-19307@0,4] at contact (null)
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch job prterun-ip-172-31-10-115-19307@0 recvd 3 of 5 reported daemons
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,3]
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch from daemon [prterun-ip-172-31-10-115-19307@0,3] on node compute-st-c5n18xlarge-3
[ip-172-31-10-115:19307] ALIASES FOR NODE compute-st-c5n18xlarge-3 (compute-st-c5n18xlarge-3)
[ip-172-31-10-115:19307] ALIAS: 172.31.13.250
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] RECEIVED TOPOLOGY SIG 2N:2S:0L3:0L2:0L1:36C:36H:0-35::x86_64:le FROM NODE compute-st-c5n18xlarge-3
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] TOPOLOGY SIGNATURE ALREADY RECORDED
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch completed for daemon [prterun-ip-172-31-10-115-19307@0,3] at contact (null)
[ip-172-31-10-115:19307] [prterun-ip-172-31-10-115-19307@0,0] plm:base:orted_report_launch job prterun-ip-172-31-10-115-19307@0 recvd 4 of 5 reported daemons
<hangs here>
Sigh - I wish I could get access to these bloody machines! What is the hostfile?
I can let you access my head node if that makes it easier, what e-mail do you use now? (Doesn't look like you're in open-mpi slack anymore)
rhc at pmix.org
errr...what did I do differently? I just tried this:
mpirun -np 2 --hostfile hostfile --prtemca plm_base_verbose 5 /bin/true
on your machine and it ran perfectly. I set the following in the environ:
$ export PATH=/home/ec2-user/install/bin:$PATH
$ export LD_LIBRARY_PATH=/home/ec2-user/install/lib:$LD_LIBRARY_PATH
$ which mpirun
~/install/bin/mpirun
@rhc54 It doesn't occur on every run, that's why I set the loop. It occurs maybe once every 20 runs.
Okay, I got it - thanks!
Background information
PMIx - 41f4225d6fb806ff218eb229a9a25baf5a97c5fa PRRTe - 0b580da7c8952a95a39a2cdb5d13b3453fb934ce
Please describe the system on which you are running
Amazon Linux 2 c5n.18xlarge
Details of the problem
Segfaults in hwloc_get_cpubind fairly frequently. I run this command:
Sometimes it also throws this error: