Closed hppritcha closed 7 years ago
That's impossible on 2.1.0rc3 as the code in the cited commit hasn't come across to the 2.x branch, and never will. It could be possible on master, but then we have two failure modes exhibiting the same behavior - possible, but odd.
Here's the traceback that the user see's on master when using the pbs/torque cluster:
hn003[1](~) ps x
PID TTY STAT TIME COMMAND
5759 pts/0 S 0:00 -bash
5761 pts/0 S 0:00 /cm/shared/apps/torque/current/sbin/pbs_mom -p -d /cm/local/app
5762 pts/0 S 0:00 pbs_demux
5800 pts/0 SLl+ 0:00 mpirun ./a.out
5844 ? S 0:00 sshd:
5845 pts/1 Ss 0:00 -bash
5881 pts/1 R+ 0:00 ps x
@hn003[1](~) gdb -p 5800
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-75.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 5800
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/bin/orterun...done.
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libopen-rte.so.0...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libopen-rte.so.0
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libopen-pal.so.0...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libopen-pal.so.0
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libnuma.so.1
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 5812]
[New LWP 5811]
[New LWP 5810]
[New LWP 5809]
[New LWP 5808]
[New LWP 5807]
[New LWP 5806]
[New LWP 5805]
[New LWP 5804]
[New LWP 5803]
[New LWP 5802]
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_flux.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_flux.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_ompi.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_ompi.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_orte.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_schizo_orte.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_shmem_mmap.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_shmem_mmap.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_ess_hnp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_ess_hnp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_pstat_linux.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_pstat_linux.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_state_hnp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_state_hnp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_errmgr_default_hnp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_errmgr_default_hnp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_plm_tm.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_plm_tm.so
Reading symbols from /cm/shared/apps/torque/5.1.0/lib/libtorque.so.2...done.
Loaded symbols for /cm/shared/apps/torque/5.1.0/lib/libtorque.so.2
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_binomial.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_binomial.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_debruijn.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_debruijn.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_direct.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_direct.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_radix.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_routed_radix.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_oob_tcp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_oob_tcp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_oob_ud.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_oob_ud.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libmca_common_verbs.so.0...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/libmca_common_verbs.so.0
Reading symbols from /usr/lib64/libosmcomp.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libosmcomp.so.3
Reading symbols from /usr/lib64/libibverbs.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libibverbs.so.1
Reading symbols from /usr/lib64/libibumad.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libibumad.so.3
Reading symbols from /lib64/libnl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnl.so.1
Reading symbols from /usr/lib64/libmlx4-rdmav2.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmlx4-rdmav2.so
Reading symbols from /usr/lib64/libmlx5-rdmav2.so...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libmlx5-rdmav2.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rml_oob.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rml_oob.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_grpcomm_direct.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_grpcomm_direct.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_ras_tm.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_ras_tm.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_mindist.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_mindist.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_ppr.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_ppr.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_rank_file.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_rank_file.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_resilient.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_resilient.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_round_robin.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_round_robin.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_seq.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rmaps_seq.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_odls_default.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_odls_default.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rtc_hwloc.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_rtc_hwloc.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_pmix_pmix2x.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_pmix_pmix2x.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_ptl_tcp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_ptl_tcp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_ptl_usock.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_ptl_usock.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_psec_native.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_psec_native.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_psec_none.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/pmix/mca_psec_none.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_iof_hnp.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_iof_hnp.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_filem_raw.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_filem_raw.so
Reading symbols from /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_dfs_orted.so...(no debugging symbols found)...done.
Loaded symbols for /gpfs/home/arcurtis/opt/ompi-ucx/git/lib/openmpi/mca_dfs_orted.so
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000003b84adf1b3 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-33.el6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libibumad-1.3.9.MLNX20140817.485ffa6-0.1.x86_64 libibverbs-1.1.8mlnx1-OFED.2.4.45.ga305acd.x86_64 libmlx4-1.0.6mlnx1-OFED.2.4.0.1.2.x86_64 libmlx5-1.0.1mlnx2-OFED.2.4.46.g727de14.x86_64 libnl-1.1.4-2.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64 libxml2-2.7.6-14.el6_5.2.x86_64 numactl-2.0.9-2.el6.x86_64 opensm-libs-4.3.0.MLNX20141222.713c9d5-0.1.x86_64 openssl-1.0.1e-30.el6.11.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) where
#0 0x0000003b84adf1b3 in poll () from /lib64/libc.so.6
#1 0x00002aaaaadd173a in poll_dispatch (base=0x651450, tv=0x0)
at ../../../../../../openmpi-git/opal/mca/event/libevent2022/libevent/poll.c:165
#2 0x00002aaaaadcb6f6 in opal_libevent2022_event_base_loop (base=0x651450, flags=1)
at ../../../../../../openmpi-git/opal/mca/event/libevent2022/libevent/event.c:1630
#3 0x0000000000401259 in orterun (argc=2, argv=0x7fffffffd858)
at ../../../../openmpi-git/orte/tools/orterun/orterun.c:197
#4 0x0000000000400e14 in main (argc=2, argv=0x7fffffffd858)
at ../../../../openmpi-git/orte/tools/orterun/main.c:13
I'm afraid that doesn't tell us anything - just shows that mpirun is idling while it waits for something to happen (likely waiting for a daemon to finish). What we would need is the output from adding "-mca state_base_verbose 5 -mca plm_base_verbose 5" so we can see what is happening on the backend nodes
**arcurtis@hn003[1](~/mpi-test) mpirun -mca state_base_verbose 5 -mca plm_base_verbose 5 ./a.out**
[hn003:06829] [[35367,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT ../../../../../openmpi-git/orte/mca/plm/tm/plm_tm_module.c:155
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE INIT_COMPLETE AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:348
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE PENDING ALLOCATION AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:359
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE ALLOCATION COMPLETE AT ../../../../openmpi-git/orte/mca/ras/base/ras_base_allocate.c:444
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE PENDING DAEMON LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:185
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE ALL DAEMONS REPORTED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:1212
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE VM READY AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:173
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE PENDING FINAL SYSTEM PREP AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:210
[hn003:06829] [[35367,0],0] complete_setup on job [35367,1]
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE PENDING APP LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:454
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1126
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state RUNNING
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state RUNNING
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state SYNC REGISTERED
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state SYNC REGISTERED
[hn004:31748] [[35367,0],1] ACTIVATE JOB [35367,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1126
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn003:06829] [[35367,0],0] plm:base:receive update proc state command from [[35367,0],1]
[hn003:06829] [[35367,0],0] plm:base:receive got update_proc_state for job [35367,1]
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state RUNNING
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state RUNNING
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:618
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
_serve[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
r_gen.c:82[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state SYNC REGISTERED
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state SYNC REGISTERED
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:628
[hn003:06829] [[35367,0],0] ACTIVATE JOB [35367,1] STATE READY FOR DEBUGGERS AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:797
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 0 of 2 on "hn004"
Hello from 1 of 2 on "hn004"
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state IOF COMPLETE
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],1] STATE IOF COMPLETE AT ../../../../.[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state IOF COMPLETE
./openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],0] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1394
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],1] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1394
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],0] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:31748] [[35367,0],1] ACTIVATE PROC [[35367,1],1] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn003:06829] [[35367,0],0] plm:base:receive update proc state command from [[35367,0],1]
[hn003:06829] [[35367,0],0] plm:base:receive got update_proc_state for job [35367,1]
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],0] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:06829] [[35367,0],0] ACTIVATE PROC [[35367,1],1] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],0] state NORMALLY TERMINATED
[hn003:06829] [[35367,0],0] state:base:track_procs called for proc [[35367,1],1] state NORMALLY TERMINATED
... and hangs here...
The user notices that on the other node, the app process exited and only the orted process remains. User double checked 2.1.0rc3 and he does not see the problem there.
@tonycurtis would you mind running using the mpirun options in the above comment:
--mca state_base_verbose 5 -mca plm_base_verbose 5
Sorry to be a pain - but could you configure OMPI with --enable-debug
? We aren't seeing some of the debug output. It looks like both daemons thought they should be launching ranks 0 and 1, and so nobody actually launched ranks 2,3.
arcurtis@hn003[1](~/mpi-test) mpirun --mca state_base_verbose 5 -mca plm_base_verbose 5 ./a.out
[hn003:08073] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[hn003:08073] plm:base:set_hnp_name: initial bias 8073 nodename hash 1075630230
[hn003:08073] plm:base:set_hnp_name: final jobfam 36611
[hn003:08073] [[36611,0],0] plm:base:receive start comm
[hn003:08073] [[36611,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT ../../../../../openmpi-git/orte/mca/plm/tm/plm_tm_module.c:155
[hn003:08073] [[36611,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[hn003:08073] [[36611,0],0] plm:base:setup_job
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE INIT_COMPLETE AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:348
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE INIT_COMPLETE PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE PENDING ALLOCATION AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:359
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE PENDING ALLOCATION PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE ALLOCATION COMPLETE AT ../../../../openmpi-git/orte/mca/ras/base/ras_base_allocate.c:444
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE ALLOCATION COMPLETE PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE PENDING DAEMON LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:185
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE PENDING DAEMON LAUNCH PRI 4
[hn003:08073] [[36611,0],0] plm:base:setup_vm
[hn003:08073] [[36611,0],0] plm:base:setup_vm creating map
[hn003:08073] [[36611,0],0] plm:base:setup_vm add new daemon [[36611,0],1]
[hn003:08073] [[36611,0],0] plm:base:setup_vm assigning new daemon [[36611,0],1] to node hn004
[hn003:08073] [[36611,0],0] plm:tm: launching vm
[hn003:08073] [[36611,0],0] plm:tm: final top-level argv:
orted -mca ess tm -mca ess_base_jobid 2399338496 -mca ess_base_vpid <template> -mca ess_base_num_procs 2 -mca orte_hnp_uri 2399338496.0;tcp://10.10.0.203,10.10.4.203:57772;ud://12930.120.1 -mca orte_node_regex hn004 --mca state_base_verbose 5 -mca plm_base_verbose 5
[hn003:08073] [[36611,0],0] plm:tm: launching on node hn004
[hn003:08073] [[36611,0],0] plm:tm: executing:
orted -mca ess tm -mca ess_base_jobid 2399338496 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri 2399338496.0;tcp://10.10.0.203,10.10.4.203:57772;ud://12930.120.1 -mca orte_node_regex hn004 --mca state_base_verbose 5 -mca plm_base_verbose 5
[hn003:08073] [[36611,0],0] plm:tm:launch: finished spawning orteds
[hn004:00432] [[36611,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[hn004:00432] [[36611,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[hn004:00432] [[36611,0],1] plm:base:receive start comm
[hn003:08073] [[36611,0],0] plm:base:orted_report_launch from daemon [[36611,0],1]
[hn003:08073] [[36611,0],0] plm:base:orted_report_launch from daemon [[36611,0],1] on node hn004
[hn003:08073] [[36611,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:16L2:16L1:16C:16H:x86_64 FROM NODE hn004
[hn003:08073] [[36611,0],0] TOPOLOGY ALREADY RECORDED
[hn003:08073] [[36611,0],0] plm:base:orted_report_launch completed for daemon [[36611,0],1] at contact 2399338496.1;tcp://10.10.0.204,10.10.4.204:36810;ud://7974.118.1
[hn003:08073] [[36611,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE ALL DAEMONS REPORTED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:1212
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE ALL DAEMONS REPORTED PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE VM READY AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:173
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE VM READY PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE PENDING FINAL SYSTEM PREP AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:210
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[hn003:08073] [[36611,0],0] complete_setup on job [36611,1]
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE PENDING APP LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:454
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE PENDING APP LAUNCH PRI 4
[hn003:08073] [[36611,0],0] plm:base:launch_apps for job [36611,1]
[hn004:00432] [[36611,0],1] ACTIVATE JOB [36611,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1126
[hn004:00432] [[36611,0],1] ACTIVATING JOB [36611,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1126
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],0] STATE RUNNING PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE RUNNING PRI 4
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],1] STATE RUNNING PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_jobs sending local launch complete for job [36611,1]
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],0] state RUNNING
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],1] state RUNNING
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:764
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE RUNNING PRI 4
[hn003:08073] [[36611,0],0] plm:base:receive processing msg
[hn003:08073] [[36611,0],0] plm:base:receive update proc state command from [[36611,0],1]
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for job [36611,1]
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE RUNNING PRI 4
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for vpid 1 state RUNNING exit_code 0
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE RUNNING PRI 4
[hn003:08073] [[36611,0],0] plm:base:receive done processing commands
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state RUNNING
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state RUNNING
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state RUNNING
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state RUNNING
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:618
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE RUNNING PRI 4
[hn003:08073] [[36611,0],0] plm:base:launch wiring up iof for job [36611,1]
[hn003:08073] [[36611,0],0] plm:base:launch job [36611,1] is not a dynamic spawn
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],0] STATE SYNC REGISTERED PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],0] state SYNC REGISTERED
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],1] STATE SYNC REGISTERED PRI 4
[hn004:00432] [[36611,0],1[hn003:08073] [[36611,0],0] plm:base:receive processing msg
] stat[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
e:or[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE SYNC REGISTERED PRI 4
ted:tra[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
ck_[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE SYNC REGISTERED PRI 4
pro[hn003:08073] [[36611,0],0] plm:base:receive done processing commands
cs called fo[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state SYNC REGISTERED
r pr[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state SYNC REGISTERED
oc [[36611,1],1] state SYNC REGISTERED
[hn004:00432] [[36611,0],1] state:orted: notifying HNP all local registered
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE SYNC REGISTERED PRI 4
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state SYNC REGISTERED
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE SYNC REGISTERED PRI 4
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state SYNC REGISTERED
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:628
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE SYNC REGISTERED PRI 4
[hn003:08073] [[36611,0],0] plm:base:launch [36611,1] registered
[hn003:08073] [[36611,0],0] plm:base:launch job [36611,1] is not a dynamic spawn
[hn003:08073] [[36611,0],0] ACTIVATE JOB [36611,1] STATE READY FOR DEBUGGERS AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:797
[hn003:08073] [[36611,0],0] ACTIVATING JOB [36611,1] STATE READY FOR DEBUGGERS PRI 4
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 0 of 2 on "hn004"
Hello from 1 of 2 on "hn004"
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],0] STATE IOF COMPLETE PRI 4
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],1] STATE IOF COMPLETE PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],0] state IOF COMPLETE
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],0] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1394
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],0] STATE WAITPID FIRED PRI 4
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],1] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1394
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],1] STATE WAITPID FIRED PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],1] state IOF COMPLETE
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],0] state WAITPID FIRED
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],0] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],0] STATE NORMALLY TERMINATED PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],1] state WAITPID FIRED
[hn004:00432] [[36611,0],1] ACTIVATE PROC [[36611,1],1] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:00432] [[36611,0],1] ACTIVATING PROC [[36611,1],1] STATE NORMALLY TERMINATED PRI 4
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],0] state NORMALLY TERMINATED
[hn004:00432] [[36611,0],1] state:orted:track_procs called for proc [[36611,1],1] state NORMALLY TERMINATED
[hn004:00432] [[36611,0],1] state:orted: SENDING JOB LOCAL TERMINATION UPDATE FOR JOB [36611,1]
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE IOF COMPLETE PRI 4
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE IOF COMPLETE PRI 4
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state IOF COMPLETE
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state IOF COMPLETE
[hn004:00432] [[36611,0],1] state:orted releasing procs from node hn004
[hn004:00432] [[36611,0],1] state:orted releasing proc [[36611,1],0] from node hn004
[hn004:00432] [[36611,0],1] state:orted releasing proc [[36611,1],1] from node hn004
[hn003:08073] [[36611,0],0] plm:base:receive processing msg
[hn003:08073] [[36611,0],0] plm:base:receive update proc state command from [[36611,0],1]
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for job [36611,1]
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],0] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],0] STATE NORMALLY TERMINATED PRI 4
[hn003:08073] [[36611,0],0] plm:base:receive got update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
[hn003:08073] [[36611,0],0] ACTIVATE PROC [[36611,1],1] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:08073] [[36611,0],0] ACTIVATING PROC [[36611,1],1] STATE NORMALLY TERMINATED PRI 4
[hn003:08073] [[36611,0],0] plm:base:receive done processing commands
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],0] state NORMALLY TERMINATED
[hn003:08073] [[36611,0],0] state:base:cleanup_node on proc [[36611,1],0]
[hn003:08073] [[36611,0],0] state:base:track_procs called for proc [[36611,1],1] state NORMALLY TERMINATED
[hn003:08073] [[36611,0],0] state:base:cleanup_node on proc [[36611,1],1]
and hang, remaining processes as below:
arcurtis@lired[1](~) ssh hn003 pstree -Aa $USER
bash
|-mpirun --mca state_base_verbose 5 -mca plm_base_verbose 5 ./a.out
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| |-{mpirun}
| `-{mpirun}
|-pbs_demux
`-pbs_mom -p -d /cm/local/apps/torque/var/spool
sshd
`-pstree -Aa arcurtis
arcurtis@lired[1](~) ssh hn004 pstree -Aa $USER
orted -mca ess tm -mca ess_base_jobid 2955149312 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri2955149312.0;tcp:/
|-{orted}
|-{orted}
|-{orted}
|-{orted}
|-{orted}
|-{orted}
|-{orted}
|-{orted}
|-{orted}
`-{orted}
sshd
`-pstree -Aa arcurtis
Okay, I see the problem - this PR should fix it: https://github.com/open-mpi/ompi/pull/3152
@tonycurtis Could you please update from master and verify this problem is fixed?
Same problem.
arcurtis@hn003[1](~/mpi-test) ompi_info | head -20
Package: Open MPI arcurtis@lired Distribution
Open MPI: 3.0.0a1
Open MPI repo revision: v2.x-dev-3866-ge4a35f2
Open MPI release date: Unreleased developer copy
Open RTE: 3.0.0a1
Open RTE repo revision: v2.x-dev-3866-ge4a35f2
Open RTE release date: Unreleased developer copy
OPAL: 3.0.0a1
OPAL repo revision: v2.x-dev-3866-ge4a35f2
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 3.0.0a1
Prefix: /gpfs/home/arcurtis/opt/ompi-ucx/git
Configured architecture: x86_64-unknown-linux-gnu
Configure host: lired
Configured by: arcurtis
Configured on: Mon Mar 13 17:30:20 EDT 2017
Trace:
arcurtis@hn003[1](~/mpi-test) mpirun --mca state_base_verbose 5 --mca plm_base_verbose 5 ./a.out
[hn003:14316] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[hn003:14316] plm:base:set_hnp_name: initial bias 14316 nodename hash 1075630230
[hn003:14316] plm:base:set_hnp_name: final jobfam 42854
[hn003:14316] [[42854,0],0] plm:base:receive start comm
[hn003:14316] [[42854,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT ../../../../../openmpi-git/orte/mca/plm/tm/plm_tm_module.c:155
[hn003:14316] [[42854,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[hn003:14316] [[42854,0],0] plm:base:setup_job
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE INIT_COMPLETE AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:348
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE INIT_COMPLETE PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE PENDING ALLOCATION AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:359
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE PENDING ALLOCATION PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE ALLOCATION COMPLETE AT ../../../../openmpi-git/orte/mca/ras/base/ras_base_allocate.c:444
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE ALLOCATION COMPLETE PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE PENDING DAEMON LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:185
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE PENDING DAEMON LAUNCH PRI 4
[hn003:14316] [[42854,0],0] plm:base:setup_vm
[hn003:14316] [[42854,0],0] plm:base:setup_vm creating map
[hn003:14316] [[42854,0],0] plm:base:setup_vm add new daemon [[42854,0],1]
[hn003:14316] [[42854,0],0] plm:base:setup_vm assigning new daemon [[42854,0],1] to node hn004
[hn003:14316] [[42854,0],0] plm:tm: launching vm
[hn003:14316] [[42854,0],0] plm:tm: final top-level argv:
orted -mca ess tm -mca ess_base_jobid 2808479744 -mca ess_base_vpid <template> -mca ess_base_num_procs 2 -mca orte_hnp_uri 2808479744.0;tcp://10.10.0.203,10.10.4.203:45869;ud://13573.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5
[hn003:14316] [[42854,0],0] plm:tm: launching on node hn004
[hn003:14316] [[42854,0],0] plm:tm: executing:
orted -mca ess tm -mca ess_base_jobid 2808479744 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri 2808479744.0;tcp://10.10.0.203,10.10.4.203:45869;ud://13573.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5
[hn003:14316] [[42854,0],0] plm:tm:launch: finished spawning orteds
[hn004:06288] [[42854,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[hn004:06288] [[42854,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[hn004:06288] [[42854,0],1] plm:base:receive start comm
[hn003:14316] [[42854,0],0] plm:base:orted_report_launch from daemon [[42854,0],1]
[hn003:14316] [[42854,0],0] plm:base:orted_report_launch from daemon [[42854,0],1] on node hn004
[hn003:14316] [[42854,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:16L2:16L1:16C:16H:x86_64 FROM NODE hn004
[hn003:14316] [[42854,0],0] TOPOLOGY ALREADY RECORDED
[hn003:14316] [[42854,0],0] plm:base:orted_report_launch completed for daemon [[42854,0],1] at contact 2808479744.1;tcp://10.10.0.204,10.10.4.204:51831;ud://8515.118.1
[hn003:14316] [[42854,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE ALL DAEMONS REPORTED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:1212
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE ALL DAEMONS REPORTED PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE VM READY AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:173
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE VM READY PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE PENDING FINAL SYSTEM PREP AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:210
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[hn003:14316] [[42854,0],0] complete_setup on job [42854,1]
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE PENDING APP LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:454
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE PENDING APP LAUNCH PRI 4
[hn003:14316] [[42854,0],0] plm:base:launch_apps for job [42854,1]
[hn004:06288] [[42854,0],1] ACTIVATE JOB [42854,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1136
[hn004:06288] [[42854,0],1] ACTIVATING JOB [42854,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1136
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],0] STATE RUNNING PRI 4
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE RUNNING PRI 4
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],1] STATE RUNNING PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_jobs sending local launch complete for job [42854,1]
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],0] state RUNNING
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],1] state RUNNING
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE RUNNING PRI 4
[hn003:14316] [[42854,0],0] plm:base:receive processing msg
[hn003:14316] [[42854,0],0] plm:base:receive update proc state command from [[42854,0],1]
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for job [42854,1]
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE RUNNING PRI 4
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for vpid 1 state RUNNING exit_code 0
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE RUNNING PRI 4
[hn003:14316] [[42854,0],0] plm:base:receive done processing commands
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state RUNNING
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state RUNNING
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state RUNNING
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state RUNNING
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:618
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE RUNNING PRI 4
[hn003:14316] [[42854,0],0] plm:base:launch wiring up iof for job [42854,1]
[hn003:14316] [[42854,0],0] plm:base:launch job [42854,1] is not a dynamic spawn
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],0] STATE SYNC REGISTERED PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],0] state SYNC REGISTERED
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],1] STATE SYNC REGISTERED PRI 4
[hn004:06288] [[hn003:14316] [[42854,0],0] plm:base:receive processing msg
[42854,0],[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
1] st[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE SYNC REGISTERED PRI 4
ate:ort[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
ed:tr[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE SYNC REGISTERED PRI 4
ack_[hn003:14316] [[42854,0],0] plm:base:receive done processing commands
procs called f[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state SYNC REGISTERED
or pr[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state SYNC REGISTERED
oc [[42854,1],1] state SYNC REGISTERED
[hn004:06288] [[42854,0],1] state:orted: notifying HNP all local registered
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE SYNC REGISTERED PRI 4
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state SYNC REGISTERED
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE SYNC REGISTERED PRI 4
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state SYNC REGISTERED
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:628
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE SYNC REGISTERED PRI 4
[hn003:14316] [[42854,0],0] plm:base:launch [42854,1] registered
[hn003:14316] [[42854,0],0] plm:base:launch job [42854,1] is not a dynamic spawn
[hn003:14316] [[42854,0],0] ACTIVATE JOB [42854,1] STATE READY FOR DEBUGGERS AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:797
[hn003:14316] [[42854,0],0] ACTIVATING JOB [42854,1] STATE READY FOR DEBUGGERS PRI 4
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 0 of 2 on "hn004"
Hello from 1 of 2 on "hn004"
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE IOF COMPLETE PRI 4
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state IOF COMPLETE
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_o[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
rted_[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE IOF COMPLETE PRI 4
read.c:170
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],1] STATE IOF COMPLETE PR[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state IOF COMPLETE
I 4
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],0] STATE IOF COMPLETE PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],1] state IOF COMPLETE
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],0] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1404
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],0] STATE WAITPID FIRED PRI 4
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],1] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1404
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],1] STATE WAITPID FIRED PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],0] state IOF COMPLETE
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],0] state WAITPID FIRED
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],0] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],0] STATE NORMALLY TERMINATED PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],1] state WAITPID FIRED
[hn004:06288] [[42854,0],1] ACTIVATE PROC [[42854,1],1] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:06288] [[42854,0],1] ACTIVATING PROC [[42854,1],1] STATE NORMALLY TERMINATED PRI 4
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],0] state NORMALLY TERMINATED
[hn004:06288] [[42854,0],1] state:orted:track_procs called for proc [[42854,1],1] state NORMALLY TERMINATED
[hn004:06288] [[42854,0],1] state:orted: SENDING JOB LOCAL TERMINATION UPDATE FOR JOB [42854,1]
[hn004:06288] [[42854,0],1] state:orted releasing procs from node hn004
[hn004:06288] [[42854,0],1] state:orted releasing proc [[42854,1],0] from node hn004
[hn004:06288] [[42854,0],1] state:orted releasing proc [[42854,1],1] from node hn004
[hn003:14316] [[42854,0],0] plm:base:receive processing msg
[hn003:14316] [[42854,0],0] plm:base:receive update proc state command from [[42854,0],1]
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for job [42854,1]
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],0] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],0] STATE NORMALLY TERMINATED PRI 4
[hn003:14316] [[42854,0],0] plm:base:receive got update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
[hn003:14316] [[42854,0],0] ACTIVATE PROC [[42854,1],1] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:14316] [[42854,0],0] ACTIVATING PROC [[42854,1],1] STATE NORMALLY TERMINATED PRI 4
[hn003:14316] [[42854,0],0] plm:base:receive done processing commands
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],0] state NORMALLY TERMINATED
[hn003:14316] [[42854,0],0] state:base:cleanup_node on proc [[42854,1],0]
[hn003:14316] [[42854,0],0] state:base:track_procs called for proc [[42854,1],1] state NORMALLY TERMINATED
[hn003:14316] [[42854,0],0] state:base:cleanup_node on proc [[42854,1],1]
And same program running fine with earlier commit
arcurtis@hn003[1](~/mpi-test) ompi_info | head -20
Package: Open MPI arcurtis@lired Distribution
Open MPI: 3.0.0a1
Open MPI repo revision: v2.x-dev-3834-g3caeda2
Open MPI release date: Unreleased developer copy
Open RTE: 3.0.0a1
Open RTE repo revision: v2.x-dev-3834-g3caeda2
Open RTE release date: Unreleased developer copy
OPAL: 3.0.0a1
OPAL repo revision: v2.x-dev-3834-g3caeda2
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 3.0.0a1
Prefix: /gpfs/home/arcurtis/opt/ompi-ucx/3caeda21dcf75db3aa0effee0964e2d7fdec5453
Configured architecture: x86_64-unknown-linux-gnu
Configure host: lired
Configured by: arcurtis
Configured on: Fri Mar 10 12:08:52 EST 2017
Configure host: lired
Configure command line: '--prefix=/gpfs/home/arcurtis/opt/ompi-ucx/3caeda21dcf75db3aa0effee0964e2d7fdec5453' '--with-knem=/opt/knem-1.1.1.90mlnx' '--without-slurm' '--with-ucx=/gpfs/home/arcurtis/opt/ucx'
Built by: arcurtis
arcurtis@hn003[1](~/mpi-test) mpicc hello-mpi.c
arcurtis@hn003[1](~/mpi-test) mpirun ./a.out
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 2 of 4 on "hn004"
Hello from 3 of 4 on "hn004"
arcurtis@hn003[1](~/mpi-test)
Okay, well the regex is now correct and so both nodes should be seeing each other, so that's progress. Next question is: why does the 2nd node not realize that the first node has procs on it?
Let's add -mca rmaps_base_verbose 5
and see what that tells us.
Reduced to 2 nodes, 1 rank per node
arcurtis@hn003[1](~/mpi-test) mpirun --mca state_base_verbose 5 --mca plm_base_verbose 5 --mca rmaps_base_verbose 5 ./a.out
[hn003:17033] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[hn003:17033] plm:base:set_hnp_name: initial bias 17033 nodename hash 1075630230
[hn003:17033] plm:base:set_hnp_name: final jobfam 53763
[hn003:17033] [[53763,0],0] plm:base:receive start comm
[hn003:17033] [[53763,0],0] rmaps:base set policy with NULL device NONNULL
[hn003:17033] mca:rmaps:select: checking available component mindist
[hn003:17033] mca:rmaps:select: Querying component [mindist]
[hn003:17033] mca:rmaps:select: checking available component ppr
[hn003:17033] mca:rmaps:select: Querying component [ppr]
[hn003:17033] mca:rmaps:select: checking available component rank_file
[hn003:17033] mca:rmaps:select: Querying component [rank_file]
[hn003:17033] mca:rmaps:select: checking available component resilient
[hn003:17033] mca:rmaps:select: Querying component [resilient]
[hn003:17033] mca:rmaps:select: checking available component round_robin
[hn003:17033] mca:rmaps:select: Querying component [round_robin]
[hn003:17033] mca:rmaps:select: checking available component seq
[hn003:17033] mca:rmaps:select: Querying component [seq]
[hn003:17033] [[53763,0],0]: Final mapper priorities
[hn003:17033] Mapper: ppr Priority: 90
[hn003:17033] Mapper: seq Priority: 60
[hn003:17033] Mapper: resilient Priority: 40
[hn003:17033] Mapper: mindist Priority: 20
[hn003:17033] Mapper: round_robin Priority: 10
[hn003:17033] Mapper: rank_file Priority: 0
[hn003:17033] [[53763,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT ../../../../../openmpi-git/orte/mca/plm/tm/plm_tm_module.c:155
[hn003:17033] [[53763,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[hn003:17033] [[53763,0],0] plm:base:setup_job
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE INIT_COMPLETE AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:348
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE INIT_COMPLETE PRI 4
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE PENDING ALLOCATION AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:359
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE PENDING ALLOCATION PRI 4
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE ALLOCATION COMPLETE AT ../../../../openmpi-git/orte/mca/ras/base/ras_base_allocate.c:444
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE ALLOCATION COMPLETE PRI 4
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE PENDING DAEMON LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:185
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE PENDING DAEMON LAUNCH PRI 4
[hn003:17033] [[53763,0],0] plm:base:setup_vm
[hn003:17033] [[53763,0],0] plm:base:setup_vm creating map
[hn003:17033] [[53763,0],0] plm:base:setup_vm add new daemon [[53763,0],1]
[hn003:17033] [[53763,0],0] plm:base:setup_vm assigning new daemon [[53763,0],1] to node hn004
[hn003:17033] [[53763,0],0] plm:tm: launching vm
[hn003:17033] [[53763,0],0] plm:tm: final top-level argv:
orted -mca ess tm -mca ess_base_jobid 3523411968 -mca ess_base_vpid <template> -mca ess_base_num_procs 2 -mca orte_hnp_uri 3523411968.0;tcp://10.10.0.203,10.10.4.203:47908;ud://13928.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5 --mca rmaps_base_verbose 5
[hn003:17033] [[53763,0],0] plm:tm: launching on node hn004
[hn003:17033] [[53763,0],0] plm:tm: executing:
orted -mca ess tm -mca ess_base_jobid 3523411968 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri 3523411968.0;tcp://10.10.0.203,10.10.4.203:47908;ud://13928.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5 --mca rmaps_base_verbose 5
[hn003:17033] [[53763,0],0] plm:tm:launch: finished spawning orteds
[hn004:08286] [[53763,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[hn004:08286] [[53763,0],1] rmaps:base set policy with NULL device NONNULL
[hn004:08286] mca:rmaps:select: checking available component mindist
[hn004:08286] mca:rmaps:select: Querying component [mindist]
[hn004:08286] mca:rmaps:select: checking available component ppr
[hn004:08286] mca:rmaps:select: Querying component [ppr]
[hn004:08286] mca:rmaps:select: checking available component rank_file
[hn004:08286] mca:rmaps:select: Querying component [rank_file]
[hn004:08286] mca:rmaps:select: checking available component resilient
[hn004:08286] mca:rmaps:select: Querying component [resilient]
[hn004:08286] mca:rmaps:select: checking available component round_robin
[hn004:08286] mca:rmaps:select: Querying component [round_robin]
[hn004:08286] mca:rmaps:select: checking available component seq
[hn004:08286] mca:rmaps:select: Querying component [seq]
[hn004:08286] [[53763,0],1]: Final mapper priorities
[hn004:08286] Mapper: ppr Priority: 90
[hn004:08286] Mapper: seq Priority: 60
[hn004:08286] Mapper: resilient Priority: 40
[hn004:08286] Mapper: mindist Priority: 20
[hn004:08286] Mapper: round_robin Priority: 10
[hn004:08286] Mapper: rank_file Priority: 0
[hn004:08286] [[53763,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[hn004:08286] [[53763,0],1] plm:base:receive start comm
[hn003:17033] [[53763,0],0] plm:base:orted_report_launch from daemon [[53763,0],1]
[hn003:17033] [[53763,0],0] plm:base:orted_report_launch from daemon [[53763,0],1] on node hn004
[hn003:17033] [[53763,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:16L2:16L1:16C:16H:x86_64 FROM NODE hn004
[hn003:17033] [[53763,0],0] TOPOLOGY ALREADY RECORDED
[hn003:17033] [[53763,0],0] plm:base:orted_report_launch completed for daemon [[53763,0],1] at contact 3523411968.1;tcp://10.10.0.204,10.10.4.204:39609;ud://8822.118.1
[hn003:17033] [[53763,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE ALL DAEMONS REPORTED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:1212
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE ALL DAEMONS REPORTED PRI 4
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE VM READY AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:173
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE VM READY PRI 4
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE PENDING FINAL SYSTEM PREP AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:210
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[hn003:17033] [[53763,0],0] complete_setup on job [53763,1]
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE PENDING APP LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:454
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE PENDING APP LAUNCH PRI 4
[hn003:17033] [[53763,0],0] plm:base:launch_apps for job [53763,1]
[hn003:17033] mca:rmaps: mapping job [53763,1]
[hn003:17033] [[53763,0],0] Starting with 2 nodes in list
[hn003:17033] [[53763,0],0] Filtering thru apps
[hn003:17033] [[53763,0],0] Retained 2 nodes in list
[hn003:17033] [[53763,0],0] node hn003 has 1 slots available
[hn003:17033] [[53763,0],0] node hn004 has 1 slots available
[hn003:17033] AVAILABLE NODES FOR MAPPING:
[hn003:17033] node: hn003 daemon: 0
[hn003:17033] node: hn004 daemon: 1
[hn003:17033] mca:rmaps: setting mapping policies for job [53763,1] nprocs 2
[hn003:17033] mca:rmaps[162] mapping not given - using bycore
[hn003:17033] mca:rmaps[302] binding not given - using bycore
[hn003:17033] mca:rmaps:ppr: job [53763,1] not using ppr mapper PPR NULL policy PPR NOTSET
[hn003:17033] [[53763,0],0] rmaps:seq called on job [53763,1]
[hn003:17033] mca:rmaps:seq: job [53763,1] not using seq mapper
[hn003:17033] mca:rmaps:resilient: cannot perform initial map of job [53763,1] - no fault groups
[hn003:17033] mca:rmaps:mindist: job [53763,1] not using mindist mapper
[hn003:17033] mca:rmaps:rr: mapping job [53763,1]
[hn003:17033] [[53763,0],0] Starting with 2 nodes in list
[hn003:17033] [[53763,0],0] Filtering thru apps
[hn003:17033] [[53763,0],0] Retained 2 nodes in list
[hn003:17033] [[53763,0],0] node hn003 has 1 slots available
[hn003:17033] [[53763,0],0] node hn004 has 1 slots available
[hn003:17033] AVAILABLE NODES FOR MAPPING:
[hn003:17033] node: hn003 daemon: 0
[hn003:17033] node: hn004 daemon: 1
[hn003:17033] [[53763,0],0] Starting bookmark at node hn003
[hn003:17033] [[53763,0],0] Starting at node hn003
[hn003:17033] mca:rmaps:rr: mapping no-span by Core for job [53763,1] slots 2 num_procs 2
[hn003:17033] mca:rmaps:rr: found 16 Core objects on node hn003
[hn003:17033] mca:rmaps:rr: calculated nprocs 1
[hn003:17033] mca:rmaps:rr: assigning nprocs 1
[hn003:17033] mca:rmaps:rr: found 16 Core objects on node hn004
[hn003:17033] mca:rmaps:rr: calculated nprocs 1
[hn003:17033] mca:rmaps:rr: assigning nprocs 1
[hn003:17033] mca:rmaps:base: computing vpids by slot for job [53763,1]
[hn003:17033] mca:rmaps:base: assigning rank 0 to node hn003
[hn003:17033] mca:rmaps:base: assigning rank 1 to node hn004
[hn003:17033] [[53763,0],0] rmaps:base:compute_usage
[hn003:17033] mca:rmaps: compute bindings for job [53763,1] with policy CORE:IF-SUPPORTED[1008]
[hn003:17033] mca:rmaps: bindings for job [53763,1] - bind in place
[hn003:17033] mca:rmaps: bind in place for job [53763,1] with bindings CORE:IF-SUPPORTED
[hn003:17033] BINDING PROC [[53763,1],0] TO Core NUMBER 0
[hn003:17033] [[53763,0],0] BOUND PROC [[53763,1],0] TO 0[Core:0] on node hn003
[hn004:08286] mca:rmaps: mapping job [53763,1]
[hn004:08286] [[53763,0],1] using default hostfile /gpfs/home/arcurtis/opt/ompi-ucx/git/etc/openmpi-default-hostfile
[hn004:08286] [[53763,0],1] nothing in default hostfile - using known nodes
[hn004:08286] [[53763,0],1] Starting with 1 nodes in list
[hn004:08286] [[53763,0],1] Filtering thru apps
[hn004:08286] [[53763,0],1] Retained 1 nodes in list
[hn004:08286] [[53763,0],1] node hn004 has 1 slots available
[hn004:08286] AVAILABLE NODES FOR MAPPING:
[hn004:08286] node: hn004 daemon: 1
[hn004:08286] mca:rmaps: setting mapping policies for job [53763,1] nprocs 1
[hn004:08286] mca:rmaps[162] mapping not given - using bycore
[hn004:08286] mca:rmaps[302] binding not given - using bycore
[hn004:08286] mca:rmaps:ppr: job [53763,1] not using ppr mapper PPR NULL policy PPR NOTSET
[hn004:08286] [[53763,0],1] rmaps:seq called on job [53763,1]
[hn004:08286] mca:rmaps:seq: job [53763,1] not using seq mapper
[hn004:08286] mca:rmaps:resilient: cannot perform initial map of job [53763,1] - no fault groups
[hn004:08286] mca:rmaps:mindist: job [53763,1] not using mindist mapper
[hn004:08286] mca:rmaps:rr: mapping job [53763,1]
[hn004:08286] [[53763,0],1] using default hostfile /gpfs/home/arcurtis/opt/ompi-ucx/git/etc/openmpi-default-hostfile
[hn004:08286] [[53763,0],1] nothing in default hostfile - using known nodes
[hn004:08286] [[53763,0],1] Starting with 1 nodes in list
[hn004:08286] [[53763,0],1] Filtering thru apps
[hn004:08286] [[53763,0],1] Retained 1 nodes in list
[hn004:08286] [[53763,0],1] node hn004 has 1 slots available
[hn004:08286] AVAILABLE NODES FOR MAPPING:
[hn004:08286] node: hn004 daemon: 1
[hn004:08286] [[53763,0],1] Starting bookmark at node hn004
[hn004:08286] [[53763,0],1] Starting at node hn004
[hn004:08286] mca:rmaps:rr: mapping no-span by Core for job [53763,1] slots 1 num_procs 1
[hn004:08286] mca:rmaps:rr: found 16 Core objects on node hn004
[hn004:08286] mca:rmaps:rr: calculated nprocs 1
[hn004:08286] mca:rmaps:rr: assigning nprocs 1
[hn004:08286] mca:rmaps:base: computing vpids by slot for job [53763,1]
[hn004:08286] mca:rmaps:base: assigning rank 0 to node hn004
[hn004:08286] [[53763,0],1] rmaps:base:compute_usage
[hn004:08286] mca:rmaps: compute bindings for job [53763,1] with policy CORE:IF-SUPPORTED[1008]
[hn004:08286] mca:rmaps: bindings for job [53763,1] - bind in place
[hn004:08286] mca:rmaps: bind in place for job [53763,1] with bindings CORE:IF-SUPPORTED
[hn004:08286] BINDING PROC [[53763,1],0] TO Core NUMBER 0
[hn004:08286] [[53763,0],1] BOUND PROC [[53763,1],0] TO 0[Core:0] on node hn004
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1136
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn004:08286] [[53763,0],1] ACTIVATE JOB [53763,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1136
[hn004:08286] [[53763,0],1] ACTIVATING JOB [53763,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn004:08286] [[53763,0],1] ACTIVATE PROC [[53763,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn004:08286] [[53763,0],1] ACTIVATING PROC [[53763,1],0] STATE RUNNING PRI 4
[hn004:08286] [[53763,0],1] state:orted:track_jobs sending local launch complete for job [53763,1]
[hn004:08286] [[53763,0],1] state:orted:track_procs called for proc [[53763,1],0] state RUNNING
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE RUNNING PRI 4
[hn003:17033] [[53763,0],0] plm:base:receive processing msg
[hn003:17033] [[53763,0],0] plm:base:receive update proc state command from [[53763,0],1]
[hn003:17033] [[53763,0],0] plm:base:receive got update_proc_state for job [53763,1]
[hn003:17033] [[53763,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE RUNNING PRI 4
[hn003:17033] [[53763,0],0] plm:base:receive done processing commands
[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state RUNNING
[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state RUNNING
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:618
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE RUNNING PRI 4
[hn003:17033] [[53763,0],0] plm:base:launch wiring up iof for job [53763,1]
[hn003:17033] [[53763,0],0] plm:base:launch job [53763,1] is not a dynamic spawn
[hn004:08286] [[53763,0],1] ACTIVATE PROC [[53763,1],0] STATE SYNC REGISTERED AT ../..[hn003:17033] [[53763,0],0] plm:base:receive processing msg
/openmpi-g[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
it/ort[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE SYNC REGISTERED PRI 4
e/ort[hn003:17033] [[53763,0],0] plm:base:receive done processing commands
ed/pmix/pmi[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state SYNC REGISTERED
x_server_gen.c:82
[hn004:08286] [[53763,0],1] ACTIVATING PROC [[53763,1],0] STATE SYNC REGISTERED PRI 4
[hn004:08286] [[53763,0],1] state:orted:track_procs called for proc [[53763,1],0] state SYNC REGISTERED
[hn004:08286] [[53763,0],1] state:orted: notifying HNP all local registered
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE SYNC REGISTERED PRI 4
[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state SYNC REGISTERED
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:628
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE SYNC REGISTERED PRI 4
[hn003:17033] [[53763,0],0] plm:base:launch [53763,1] registered
[hn003:17033] [[53763,0],0] plm:base:launch job [53763,1] is not a dynamic spawn
[hn003:17033] [[53763,0],0] ACTIVATE JOB [53763,1] STATE READY FOR DEBUGGERS AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:797
[hn003:17033] [[53763,0],0] ACTIVATING JOB [53763,1] STATE READY FOR DEBUGGERS PRI 4
Hello from 0 of 2 on "hn003"
Hello from 0 of 1 on "hn004"
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE IOF COMPLETE PRI 4
[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state IOF COMPLETE
[hn004:08286] [[53763,0],1] ACTIVATE PROC [[53763,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:08286] [[53763,0],1] ACTIVATING PROC [[53763,1],0] STATE IOF COMPLETE PRI 4
[hn004:08286] [[53763,0],1] state:orted:track_procs called for proc [[53763,1],0] state IOF COMPLETE
[hn004:08286] [[53763,0],1] ACTIVATE PROC [[53763,1],0] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1404
[hn004:08286] [[53763,0],1] ACTIVATING PROC [[53763,1],0] STATE WAITPID FIRED PRI 4
[hn004:08286] [[53763,0],1] state:orted:track_procs called for proc [[53763,1],0] state WAITPID FIRED
[hn004:08286] [[53763,0],1] ACTIVATE PROC [[53763,1],0] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:08286] [[53763,0],1] ACTIVATING PROC [[53763,1],0] STATE NORMALLY TERMINATED PRI 4
[hn004:08286] [[53763,0],1] state:orted:track_procs called for proc [[53763,1],0] state NORMALLY TERMINATED
[hn004:08286] [[53763,0],1] state:orted: SENDING JOB LOCAL TERMINATION UPDATE FOR JOB [53763,1]
[hn004:08286] [[53763,0],1] state:orted releasing procs from node hn004
[hn004:08286] [[53763,0],1] state:orted releasing proc [[53763,1],0] from node hn004
[hn003:17033] [[53763,0],0] plm:base:receive processing msg
[hn003:17033] [[53763,0],0] plm:base:receive update proc state command from [[53763,0],1]
[hn003:17033] [[53763,0],0] plm:base:receive got update_proc_state for job [53763,1]
[hn003:17033] [[53763,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,1],0] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,1],0] STATE NORMALLY TERMINATED PRI 4
[hn003:17033] [[53763,0],0] plm:base:receive done processing commands
[hn003:17033] [[53763,0],0] state:base:track_procs called for proc [[53763,1],0] state NORMALLY TERMINATED
[hn003:17033] [[53763,0],0] state:base:cleanup_node on proc [[53763,1],0]
^C[hn003:17033] [[53763,0],0] plm:base:orted_cmd sending orted_exit commands
[hn004:08286] [[53763,0],1] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT ../../openmpi-git/orte/orted/orted_comm.c:387
[hn004:08286] [[53763,0],1] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 4
[hn004:08286] [[53763,0],1] plm:base:receive stop comm
[hn003:17033] [[53763,0],0] ACTIVATE PROC [[53763,0],1] STATE COMMUNICATION FAILURE AT ../../../../../openmpi-git/orte/mca/oob/tcp/oob_tcp_component.c:1118
[hn003:17033] [[53763,0],0] ACTIVATING PROC [[53763,0],1] STATE COMMUNICATION FAILURE PRI 3
[hn003:17033] [[53763,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT ../../../../../openmpi-git/orte/mca/errmgr/default_hnp/errmgr_default_hnp.c:368
[hn003:17033] [[53763,0],0] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 4
[hn003:17033] [[53763,0],0] plm:base:receive stop comm
I found the issue - fix coming
@tonycurtis Please update and confirm this fixed it. Thanks for your assist and patience.
Sadly, still doing it:
$ qsub -l nodes=4:ppn=2 ...
arcurtis@hn003[1](~/mpi-test) mpirun ./a.out
Hello from 0 of 8 on "hn003"
Hello from 1 of 8 on "hn003"
Hello from 2 of 6 on "hn006"
Hello from 3 of 6 on "hn006"
Hello from 0 of 6 on "hn004"
Hello from 1 of 6 on "hn004"
Hello from 4 of 6 on "hn005"
Hello from 5 of 6 on "hn005"
^C
Don't know if this helps, but I note that the first node gets the correct total rank count, and the others get the remaining-slots-to-be-filled as job size and enumerate their ranks within that range. Will attach trace if required.
Sigh - okay, let's try and get some more info. Sadly, I don't have access to a system running Torque, so would you be willing to apply the following patch and post the resulting output?
diff --git a/orte/util/nidmap.c b/orte/util/nidmap.c
index 6a77aa464e..c06bed676c 100644
--- a/orte/util/nidmap.c
+++ b/orte/util/nidmap.c
@@ -481,6 +481,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OBJ_DESTRUCT(&nodenms);
/* pack the string */
+opal_output(0, "%s NODE REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
OPAL_LIST_DESTRUCT(&dvpids);
@@ -517,6 +518,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OPAL_LIST_DESTRUCT(&dvpids);
/* pack the string */
+ opal_output(0, "%s VPID REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
OPAL_LIST_DESTRUCT(&slots);
@@ -552,6 +554,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OPAL_LIST_DESTRUCT(&slots);
/* pack the string */
+ opal_output(0, "%s SLOTS REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
@@ -695,6 +698,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
if (NULL == ndnames) {
return ORTE_SUCCESS;
}
+ opal_output(0, "%s DECODE NODE REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) ndnames);
OBJ_CONSTRUCT(&dids, opal_list_t);
OBJ_CONSTRUCT(&slts, opal_list_t);
@@ -712,6 +716,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
rc = ORTE_ERR_BAD_PARAM;
goto cleanup;
}
+ opal_output(0, "%s DECODE VPID REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) dvpids);
/* unpack the slots regex */
n = 1;
@@ -725,6 +730,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
rc = ORTE_ERR_BAD_PARAM;
goto cleanup;
}
+ opal_output(0, "%s DECODE SLOTS REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) slots);
/* unpack the flags regex */
n = 1;
@@ -883,6 +889,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
}
/* set the number of slots */
node->slots = srng->slots;
+ opal_output(0, "%s SET NODE %s SLOTS: %d", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) node->name, node->slots);
if (srng->endpt == nn) {
srng = (orte_regex_range_t*)opal_list_get_next(&srng->super);
}
(Hopefully fixed some missing commas correctly). -l nodes=2:ppn=2. First trace just with that added above, followed by full trace.
[hn003:03147] [[40129,0],0] NODE REGEX: hn[3:3-4]
[hn003:03147] [[40129,0],0] VPID REGEX: 0-1
[hn003:03147] [[40129,0],0] SLOTS REGEX: 0-1[2]
[hn003:03147] [[40129,0],0] DECODE NODE REGEX: hn[3:3-4]
[hn003:03147] [[40129,0],0] DECODE VPID REGEX: 0-1
[hn003:03147] [[40129,0],0] DECODE SLOTS REGEX: 0-1[2]
[hn004:25047] [[40129,0],1] DECODE NODE REGEX: hn[3:3-4]
[hn004:25047] [[40129,0],1] DECODE VPID REGEX: 0-1
[hn004:25047] [[40129,0],1] DECODE SLOTS REGEX: 0-1[2]
[hn004:25047] [[40129,0],1] SET NODE hn003 SLOTS: 2
[hn004:25047] [[40129,0],1] SET NODE hn004 SLOTS: 2
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 0 of 2 on "hn004"
Hello from 1 of 2 on "hn004"
[hn003:03191] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[hn003:03191] plm:base:set_hnp_name: initial bias 3191 nodename hash 1075630230
[hn003:03191] plm:base:set_hnp_name: final jobfam 40189
[hn003:03191] [[40189,0],0] plm:base:receive start comm
[hn003:03191] [[40189,0],0] rmaps:base set policy with NULL device NONNULL
[hn003:03191] mca:rmaps:select: checking available component mindist
[hn003:03191] mca:rmaps:select: Querying component [mindist]
[hn003:03191] mca:rmaps:select: checking available component ppr
[hn003:03191] mca:rmaps:select: Querying component [ppr]
[hn003:03191] mca:rmaps:select: checking available component rank_file
[hn003:03191] mca:rmaps:select: Querying component [rank_file]
[hn003:03191] mca:rmaps:select: checking available component resilient
[hn003:03191] mca:rmaps:select: Querying component [resilient]
[hn003:03191] mca:rmaps:select: checking available component round_robin
[hn003:03191] mca:rmaps:select: Querying component [round_robin]
[hn003:03191] mca:rmaps:select: checking available component seq
[hn003:03191] mca:rmaps:select: Querying component [seq]
[hn003:03191] [[40189,0],0]: Final mapper priorities
[hn003:03191] Mapper: ppr Priority: 90
[hn003:03191] Mapper: seq Priority: 60
[hn003:03191] Mapper: resilient Priority: 40
[hn003:03191] Mapper: mindist Priority: 20
[hn003:03191] Mapper: round_robin Priority: 10
[hn003:03191] Mapper: rank_file Priority: 0
[hn003:03191] [[40189,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT ../../../../../openmpi-git/orte/mca/plm/tm/plm_tm_module.c:155
[hn003:03191] [[40189,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[hn003:03191] [[40189,0],0] plm:base:setup_job
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE INIT_COMPLETE AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:348
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE INIT_COMPLETE PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE PENDING ALLOCATION AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:359
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE PENDING ALLOCATION PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE ALLOCATION COMPLETE AT ../../../../openmpi-git/orte/mca/ras/base/ras_base_allocate.c:444
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE ALLOCATION COMPLETE PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE PENDING DAEMON LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:185
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE PENDING DAEMON LAUNCH PRI 4
[hn003:03191] [[40189,0],0] plm:base:setup_vm
[hn003:03191] [[40189,0],0] plm:base:setup_vm creating map
[hn003:03191] [[40189,0],0] plm:base:setup_vm add new daemon [[40189,0],1]
[hn003:03191] [[40189,0],0] plm:base:setup_vm assigning new daemon [[40189,0],1] to node hn004
[hn003:03191] [[40189,0],0] plm:tm: launching vm
[hn003:03191] [[40189,0],0] plm:tm: final top-level argv:
orted -mca ess tm -mca ess_base_jobid 2633826304 -mca ess_base_vpid <template> -mca ess_base_num_procs 2 -mca orte_hnp_uri 2633826304.0;tcp://10.10.0.203,10.10.4.203:42723;ud://19006.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5 --mca rmaps_base_verbose 5
[hn003:03191] [[40189,0],0] plm:tm: launching on node hn004
[hn003:03191] [[40189,0],0] plm:tm: executing:
orted -mca ess tm -mca ess_base_jobid 2633826304 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri 2633826304.0;tcp://10.10.0.203,10.10.4.203:42723;ud://19006.120.1 -mca orte_node_regex hn[3:3-4] --mca state_base_verbose 5 --mca plm_base_verbose 5 --mca rmaps_base_verbose 5
[hn003:03191] [[40189,0],0] plm:tm:launch: finished spawning orteds
[hn004:25080] [[40189,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[hn004:25080] [[40189,0],1] rmaps:base set policy with NULL device NONNULL
[hn004:25080] mca:rmaps:select: checking available component mindist
[hn004:25080] mca:rmaps:select: Querying component [mindist]
[hn004:25080] mca:rmaps:select: checking available component ppr
[hn004:25080] mca:rmaps:select: Querying component [ppr]
[hn004:25080] mca:rmaps:select: checking available component rank_file
[hn004:25080] mca:rmaps:select: Querying component [rank_file]
[hn004:25080] mca:rmaps:select: checking available component resilient
[hn004:25080] mca:rmaps:select: Querying component [resilient]
[hn004:25080] mca:rmaps:select: checking available component round_robin
[hn004:25080] mca:rmaps:select: Querying component [round_robin]
[hn004:25080] mca:rmaps:select: checking available component seq
[hn004:25080] mca:rmaps:select: Querying component [seq]
[hn004:25080] [[40189,0],1]: Final mapper priorities
[hn004:25080] Mapper: ppr Priority: 90
[hn004:25080] Mapper: seq Priority: 60
[hn004:25080] Mapper: resilient Priority: 40
[hn004:25080] Mapper: mindist Priority: 20
[hn004:25080] Mapper: round_robin Priority: 10
[hn004:25080] Mapper: rank_file Priority: 0
[hn004:25080] [[40189,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[hn004:25080] [[40189,0],1] plm:base:receive start comm
[hn003:03191] [[40189,0],0] plm:base:orted_report_launch from daemon [[40189,0],1]
[hn003:03191] [[40189,0],0] plm:base:orted_report_launch from daemon [[40189,0],1] on node hn004
[hn003:03191] [[40189,0],0] RECEIVED TOPOLOGY SIG 2N:2S:2L3:16L2:16L1:16C:16H:x86_64 FROM NODE hn004
[hn003:03191] [[40189,0],0] TOPOLOGY ALREADY RECORDED
[hn003:03191] [[40189,0],0] plm:base:orted_report_launch completed for daemon [[40189,0],1] at contact 2633826304.1;tcp://10.10.0.204,10.10.4.204:40332;ud://13254.118.1
[hn003:03191] [[40189,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE ALL DAEMONS REPORTED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:1212
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE ALL DAEMONS REPORTED PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE VM READY AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:173
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE VM READY PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE PENDING FINAL SYSTEM PREP AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:210
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[hn003:03191] [[40189,0],0] complete_setup on job [40189,1]
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE PENDING APP LAUNCH AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:454
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE PENDING APP LAUNCH PRI 4
[hn003:03191] [[40189,0],0] plm:base:launch_apps for job [40189,1]
[hn003:03191] [[40189,0],0] NODE REGEX: hn[3:3-4]
[hn003:03191] [[40189,0],0] VPID REGEX: 0-1
[hn003:03191] [[40189,0],0] SLOTS REGEX: 0-1[2]
[hn003:03191] [[40189,0],0] DECODE NODE REGEX: hn[3:3-4]
[hn003:03191] [[40189,0],0] DECODE VPID REGEX: 0-1
[hn003:03191] [[40189,0],0] DECODE SLOTS REGEX: 0-1[2]
[hn003:03191] mca:rmaps: mapping job [40189,1]
[hn003:03191] [[40189,0],0] Starting with 2 nodes in list
[hn003:03191] [[40189,0],0] Filtering thru apps
[hn003:03191] [[40189,0],0] Retained 2 nodes in list
[hn003:03191] [[40189,0],0] node hn003 has 2 slots available
[hn003:03191] [[40189,0],0] node hn004 has 2 slots available
[hn003:03191] AVAILABLE NODES FOR MAPPING:
[hn003:03191] node: hn003 daemon: 0
[hn003:03191] node: hn004 daemon: 1
[hn003:03191] mca:rmaps: setting mapping policies for job [40189,1] nprocs 4
[hn003:03191] mca:rmaps[169] mapping not set by user - using bynuma
[hn003:03191] mca:rmaps[309] binding not given - using bynuma
[hn003:03191] mca:rmaps:ppr: job [40189,1] not using ppr mapper PPR NULL policy PPR NOTSET
[hn003:03191] [[40189,0],0] rmaps:seq called on job [40189,1]
[hn003:03191] mca:rmaps:seq: job [40189,1] not using seq mapper
[hn003:03191] mca:rmaps:resilient: cannot perform initial map of job [40189,1] - no fault groups
[hn003:03191] mca:rmaps:mindist: job [40189,1] not using mindist mapper
[hn003:03191] mca:rmaps:rr: mapping job [40189,1]
[hn003:03191] [[40189,0],0] Starting with 2 nodes in list
[hn003:03191] [[40189,0],0] Filtering thru apps
[hn003:03191] [[40189,0],0] Retained 2 nodes in list
[hn003:03191] [[40189,0],0] node hn003 has 2 slots available
[hn003:03191] [[40189,0],0] node hn004 has 2 slots available
[hn003:03191] AVAILABLE NODES FOR MAPPING:
[hn003:03191] node: hn003 daemon: 0
[hn003:03191] node: hn004 daemon: 1
[hn003:03191] [[40189,0],0] Starting bookmark at node hn003
[hn003:03191] [[40189,0],0] Starting at node hn003
[hn003:03191] mca:rmaps:rr: mapping no-span by NUMANode for job [40189,1] slots 4 num_procs 4
[hn003:03191] mca:rmaps:rr: found 2 NUMANode objects on node hn003
[hn003:03191] mca:rmaps:rr: calculated nprocs 2
[hn003:03191] mca:rmaps:rr: assigning nprocs 2
[hn003:03191] mca:rmaps:rr: found 2 NUMANode objects on node hn004
[hn003:03191] mca:rmaps:rr: calculated nprocs 2
[hn003:03191] mca:rmaps:rr: assigning nprocs 2
[hn003:03191] mca:rmaps:base: computing vpids by slot for job [40189,1]
[hn003:03191] mca:rmaps:base: assigning rank 0 to node hn003
[hn003:03191] mca:rmaps:base: assigning rank 1 to node hn003
[hn003:03191] mca:rmaps:base: assigning rank 2 to node hn004
[hn003:03191] mca:rmaps:base: assigning rank 3 to node hn004
[hn003:03191] [[40189,0],0] rmaps:base:compute_usage
[hn003:03191] mca:rmaps: compute bindings for job [40189,1] with policy NUMA:IF-SUPPORTED[1003]
[hn003:03191] mca:rmaps: bindings for job [40189,1] - bind in place
[hn003:03191] mca:rmaps: bind in place for job [40189,1] with bindings NUMA:IF-SUPPORTED
[hn003:03191] BINDING PROC [[40189,1],0] TO NUMANode NUMBER 0
[hn003:03191] [[40189,0],0] BOUND PROC [[40189,1],0] TO 0-7[NUMANode:0] on node hn003
[hn003:03191] BINDING PROC [[40189,1],1] TO NUMANode NUMBER 1
[hn003:03191] [[40189,0],0] BOUND PROC [[40189,1],1] TO 8-15[NUMANode:1] on node hn003
[hn004:25080] [[40189,0],1] DECODE NODE REGEX: hn[3:3-4]
[hn004:25080] [[40189,0],1] DECODE VPID REGEX: 0-1
[hn004:25080] [[40189,0],1] DECODE SLOTS REGEX: 0-1[2]
[hn004:25080] [[40189,0],1] SET NODE hn003 SLOTS: 2
[hn004:25080] [[40189,0],1] SET NODE hn004 SLOTS: 2
[hn004:25080] mca:rmaps: mapping job [40189,1]
[hn004:25080] [[40189,0],1] using default hostfile /gpfs/home/arcurtis/opt/ompi-ucx/git/etc/openmpi-default-hostfile
[hn004:25080] [[40189,0],1] nothing in default hostfile - using known nodes
[hn004:25080] [[40189,0],1] Starting with 1 nodes in list
[hn004:25080] [[40189,0],1] Filtering thru apps
[hn004:25080] [[40189,0],1] Retained 1 nodes in list
[hn004:25080] [[40189,0],1] node hn004 has 2 slots available
[hn004:25080] AVAILABLE NODES FOR MAPPING:
[hn004:25080] node: hn004 daemon: 1
[hn004:25080] mca:rmaps: setting mapping policies for job [40189,1] nprocs 2
[hn004:25080] mca:rmaps[162] mapping not given - using bycore
[hn004:25080] mca:rmaps[302] binding not given - using bycore
[hn004:25080] mca:rmaps:ppr: job [40189,1] not using ppr mapper PPR NULL policy PPR NOTSET
[hn004:25080] [[40189,0],1] rmaps:seq called on job [40189,1]
[hn004:25080] mca:rmaps:seq: job [40189,1] not using seq mapper
[hn004:25080] mca:rmaps:resilient: cannot perform initial map of job [40189,1] - no fault groups
[hn004:25080] mca:rmaps:mindist: job [40189,1] not using mindist mapper
[hn004:25080] mca:rmaps:rr: mapping job [40189,1]
[hn004:25080] [[40189,0],1] using default hostfile /gpfs/home/arcurtis/opt/ompi-ucx/git/etc/openmpi-default-hostfile
[hn004:25080] [[40189,0],1] nothing in default hostfile - using known nodes
[hn004:25080] [[40189,0],1] Starting with 1 nodes in list
[hn004:25080] [[40189,0],1] Filtering thru apps
[hn004:25080] [[40189,0],1] Retained 1 nodes in list
[hn004:25080] [[40189,0],1] node hn004 has 2 slots available
[hn004:25080] AVAILABLE NODES FOR MAPPING:
[hn004:25080] node: hn004 daemon: 1
[hn004:25080] [[40189,0],1] Starting bookmark at node hn004
[hn004:25080] [[40189,0],1] Starting at node hn004
[hn004:25080] mca:rmaps:rr: mapping no-span by Core for job [40189,1] slots 2 num_procs 2
[hn004:25080] mca:rmaps:rr: found 16 Core objects on node hn004
[hn004:25080] mca:rmaps:rr: calculated nprocs 2
[hn004:25080] mca:rmaps:rr: assigning nprocs 2
[hn004:25080] mca:rmaps:base: computing vpids by slot for job [40189,1]
[hn004:25080] mca:rmaps:base: assigning rank 0 to node hn004
[hn004:25080] mca:rmaps:base: assigning rank 1 to node hn004
[hn004:25080] [[40189,0],1] rmaps:base:compute_usage
[hn004:25080] mca:rmaps: compute bindings for job [40189,1] with policy CORE:IF-SUPPORTED[1008]
[hn004:25080] mca:rmaps: bindings for job [40189,1] - bind in place
[hn004:25080] mca:rmaps: bind in place for job [40189,1] with bindings CORE:IF-SUPPORTED
[hn004:25080] BINDING PROC [[40189,1],0] TO Core NUMBER 0
[hn004:25080] [[40189,0],1] BOUND PROC [[40189,1],0] TO 0[Core:0] on node hn004
[hn004:25080] BINDING PROC [[40189,1],1] TO Core NUMBER 1
[hn004:25080] [[40189,0],1] BOUND PROC [[40189,1],1] TO 1[Core:1] on node hn004
[hn004:25080] [[40189,0],1] ACTIVATE JOB [40189,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openm[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE LOCAL LAUNCH COMPLETE AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1136
pi-git[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE LOCAL LAUNCH COMPLETE PRI 4
/orte/mca/odls/base/odls_base_default_fns.c:1136
[hn004:25080] [[40189,0],1] ACTIVATING JOB [40189,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],0] STATE RUNNING PRI 4
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE RUNNING PRI 4
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],1] STATE RUNNING PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_jobs sending local launch complete for job [40189,1]
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],0] state RUNNING
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],1] state RUNNING
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:771
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE RUNNING PRI 4
[hn003:03191] [[40189,0],0] plm:base:receive processing msg
[hn003:03191] [[40189,0],0] plm:base:receive update proc state command from [[40189,0],1]
[hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for job [40189,1]
[hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE RUNNING PRI 4
[hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for vpid 1 state RUNNING exit_code 0
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE RUNNING PRI 4
[hn003:03191] [[40189,0],0] plm:base:receive done processing commands
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state RUNNING
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state RUNNING
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state RUNNING
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state RUNNING
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE RUNNING AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:618
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE RUNNING PRI 4
[hn003:03191] [[40189,0],0] plm:base:launch wiring up iof for job [40189,1]
[hn003:03191] [[40189,0],0] plm:base:launch job [40189,1] is not a dynamic spawn
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],0] STATE SYNC REGISTERED PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],0] state SYNC REGISTERED
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],1] STATE SYNC REGISTERED PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],1] state[hn003:03191] [[40189,0],0] plm:base:receive processing msg
SYNC REGI[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
STER[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE SYNC REGISTERED PRI 4
ED
[hn0[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:390
04:2[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE SYNC REGISTERED PRI 4
508[hn003:03191] [[40189,0],0] plm:base:receive done processing commands
0] [[40189,0[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state SYNC REGISTERED
],1] [hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state SYNC REGISTERED
state:orted: notifying HNP all local registered
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE SYNC REGISTERED PRI 4
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state SYNC REGISTERED
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE SYNC REGISTERED AT ../../openmpi-git/orte/orted/pmix/pmix_server_gen.c:82
[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE SYNC REGISTERED PRI 4
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state SYNC REGISTERED
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE SYNC REGISTERED AT ../../../../openmpi-git/orte/mca/state/base/state_base_fns.c:628
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE SYNC REGISTERED PRI 4
[hn003:03191] [[40189,0],0] plm:base:launch [40189,1] registered
[hn003:03191] [[40189,0],0] plm:base:launch job [40189,1] is not a dynamic spawn
[hn003:03191] [[40189,0],0] ACTIVATE JOB [40189,1] STATE READY FOR DEBUGGERS AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_launch_support.c:797
[hn003:03191] [[40189,0],0] ACTIVATING JOB [40189,1] STATE READY FOR DEBUGGERS PRI 4
Hello from 0 of 4 on "hn003"
Hello from 1 of 4 on "hn003"
Hello from 0 of 2 on "hn004"
Hello from 1 of 2 on "hn004"
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],0] STATE IOF COMPLETE PRI 4
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/orted/iof_orted_read.c:170
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],1] STATE IOF COMPLETE PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],0] state IOF COMPLETE
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],0] STATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1404
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],0] STATE WAITPID FIRED PRI 4
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[401[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
89,1],1] ST[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE IOF COMPLETE PRI 4
ATE WAITPID FIRED AT ../../../../openmpi-git/orte/mca/odls/base/odls_base_default_fns.c:1404
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],1] STATE WAITPID FIRED PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],1] state IOF COMPLET[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state IOF COMPLETE
E
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],0] state WAITPID FIRED
[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE IOF COMPLETE AT ../../../../../openmpi-git/orte/mca/iof/hnp/iof_hnp_read.c:265
[hn004[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE IOF COMPLETE PRI 4
:25080] [[40189[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state IOF COMPLETE
,0],1] ACTIVATE PROC [[40189,1],0] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],0] STATE NORMALLY TERMINATED PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],1] state WAITPID FIRED
[hn004:25080] [[40189,0],1] ACTIVATE PROC [[40189,1],1] STATE NORMALLY TERMINATED AT ../../../../../openmpi-git/orte/mca/state/orted/state_orted.c:355
[hn004:25080] [[40189,0],1] ACTIVATING PROC [[40189,1],1] STATE NORMALLY TERMINATED PRI 4
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],0] state NORMALLY TERMINATED
[hn004:25080] [[40189,0],1] state:orted:track_procs called for proc [[40189,1],1] state NORMALLY TERMINATED
[hn004:25080] [[40189,0],1] state:orted: SENDING JOB LOCAL TERMINATION UPDATE FOR JOB [40189,1]
[hn004:25080] [[40189,0],1] state:orted releasing procs from node hn004
[hn004:25080] [[40189,0],1] state:orted [hn003:03191] [[40189,0],0] plm:base:receive processing msg
releas[hn003:03191] [[40189,0],0] plm:base:receive update proc state command from [[40189,0],1]
ing [hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for job [40189,1]
proc [[40[hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for vpid 0 state NORMALLY TERMINATED exit_code 0
189,1][hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],0] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
,0] f[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],0] STATE NORMALLY TERMINATED PRI 4
rom no[hn003:03191] [[40189,0],0] plm:base:receive got update_proc_state for vpid 1 state NORMALLY TERMINATED exit_code 0
de hn004
[hn004:25080] [[hn003:03191] [[40189,0],0] ACTIVATE PROC [[40189,1],1] STATE NORMALLY TERMINATED AT ../../../../openmpi-git/orte/mca/plm/base/plm_base_receive.c:351
[401[hn003:03191] [[40189,0],0] ACTIVATING PROC [[40189,1],1] STATE NORMALLY TERMINATED PRI 4
89,0[hn003:03191] [[40189,0],0] plm:base:receive done processing commands
],1] state:[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],0] state NORMALLY TERMINATED
orted releasing proc [[40189,1],1] from node hn004
[hn003:03191] [[40189,0],0] state:base:cleanup_node on proc [[40189,1],0]
[hn003:03191] [[40189,0],0] state:base:track_procs called for proc [[40189,1],1] state NORMALLY TERMINATED
[hn003:03191] [[40189,0],0] state:base:cleanup_node on proc [[40189,1],1]
Here is the problem:
[hn004:25080] [[40189,0],1] node hn004 has 2 slots available
[hn004:25080] AVAILABLE NODES FOR MAPPING:
[hn004:25080] node: hn004 daemon: 1
For some reason, the other node doesn't see the node hosting mpirun as being available. Let me see what might be going on.
Okay, there is a different code path for managed allocations, and it looks like we don't pass the flag indicating that the node where mpirun is executing should be included in the job. Sorry to be a pain, but could you please reset
your repo to clear the last diff, and then apply this one?
diff --git a/orte/mca/rmaps/base/rmaps_base_support_fns.c b/orte/mca/rmaps/base/rmaps_base_support_fns.c
index 2b1a1ccdc3..353d75f29b 100644
--- a/orte/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/orte/mca/rmaps/base/rmaps_base_support_fns.c
@@ -340,7 +340,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
goto complete;
}
- addknown:
+ addknown:
/* if the hnp was allocated, include it unless flagged not to */
if (orte_hnp_is_allocated && !(ORTE_GET_MAPPING_DIRECTIVE(policy) & ORTE_MAPPING_NO_USE_LOCAL)) {
if (NULL != (node = (orte_node_t*)opal_pointer_array_get_item(orte_node_pool, 0))) {
@@ -416,7 +416,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
ORTE_FLAG_UNSET(node, ORTE_NODE_FLAG_MAPPED);
}
if (NULL == nd || NULL == nd->daemon ||
- NULL == node->daemon ||
+ NULL == node->daemon ||
nd->daemon->name.vpid < node->daemon->name.vpid) {
/* just append to end */
opal_list_append(allocated_nodes, &node->super);
@@ -476,7 +476,7 @@ int orte_rmaps_base_get_target_nodes(opal_list_t *allocated_nodes, orte_std_cntr
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(int)opal_list_get_size(allocated_nodes)));
- complete:
+ complete:
/* remove all nodes that are already at max usage, and
* compute the total number of allocated slots while
* we do so */
diff --git a/orte/util/nidmap.c b/orte/util/nidmap.c
index 6a77aa464e..f329461ced 100644
--- a/orte/util/nidmap.c
+++ b/orte/util/nidmap.c
@@ -186,6 +186,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
char **regexargs = NULL, *tmp, *tmp2;
orte_node_t *nptr;
int rc;
+ uint8_t ui8;
/* setup the list of results */
OBJ_CONSTRUCT(&nodenms, opal_list_t);
@@ -481,6 +482,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OBJ_DESTRUCT(&nodenms);
/* pack the string */
+opal_output(0, "%s NODE REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
OPAL_LIST_DESTRUCT(&dvpids);
@@ -517,6 +519,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OPAL_LIST_DESTRUCT(&dvpids);
/* pack the string */
+ opal_output(0, "%s VPID REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
OPAL_LIST_DESTRUCT(&slots);
@@ -552,6 +555,7 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
OPAL_LIST_DESTRUCT(&slots);
/* pack the string */
+ opal_output(0, "%s SLOTS REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) tmp);
if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &tmp, 1, OPAL_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
@@ -594,6 +598,17 @@ int orte_util_encode_nodemap(opal_buffer_t *buffer)
free(tmp);
}
+ /* pack a flag indicating if the HNP was included in the allocation */
+ if (orte_hnp_is_allocated) {
+ ui8 = 1;
+ } else {
+ ui8 = 0;
+ }
+ if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &ui8, 1, OPAL_UINT8))) {
+ ORTE_ERROR_LOG(rc);
+ return rc;
+ }
+
/* handle the topologies - as the most common case by far
* is to have homogeneous topologies, we only send them
* if something is different */
@@ -684,6 +699,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
opal_buffer_t *bptr=NULL;
orte_topology_t *t;
orte_regex_range_t *rng, *drng, *srng, *frng;
+ uint8_t ui8;
/* unpack the node regex */
n = 1;
@@ -695,6 +711,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
if (NULL == ndnames) {
return ORTE_SUCCESS;
}
+ opal_output(0, "%s DECODE NODE REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) ndnames);
OBJ_CONSTRUCT(&dids, opal_list_t);
OBJ_CONSTRUCT(&slts, opal_list_t);
@@ -712,6 +729,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
rc = ORTE_ERR_BAD_PARAM;
goto cleanup;
}
+ opal_output(0, "%s DECODE VPID REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) dvpids);
/* unpack the slots regex */
n = 1;
@@ -725,6 +743,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
rc = ORTE_ERR_BAD_PARAM;
goto cleanup;
}
+ opal_output(0, "%s DECODE SLOTS REGEX: %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) slots);
/* unpack the flags regex */
n = 1;
@@ -739,6 +758,18 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
goto cleanup;
}
+ /* unpack the flag indicating if the HNP was allocated */
+ n = 1;
+ if (ORTE_SUCCESS != (rc = opal_dss.unpack(buffer, &ui8, &n, OPAL_UINT8))) {
+ ORTE_ERROR_LOG(rc);
+ goto cleanup;
+ }
+ if (0 == ui8) {
+ orte_hnp_is_allocated = false;
+ } else {
+ orte_hnp_is_allocated = true;
+ }
+
/* unpack the topos regex - this may not have been
* provided (e.g., for a homogeneous machine) */
n = 1;
@@ -883,6 +914,7 @@ int orte_util_decode_daemon_nodemap(opal_buffer_t *buffer)
}
/* set the number of slots */
node->slots = srng->slots;
+ opal_output(0, "%s SET NODE %s SLOTS: %d", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) node->name, node->slots);
if (srng->endpt == nn) {
srng = (orte_regex_range_t*)opal_list_get_next(&srng->super);
}
Do I need to be at a particular git-checkout? The rmaps_base_support_fns.c patch is being rejected (clean git-clone source tree).
you can just ignore that part - it's the other one in nidmap.c that is important
I actually think I found one more required change, so might want to hang on a minute
Someone else hit a similar problem that turned out to have the same root cause, so I have setup a PR to fix it. Once that gets committed, I think you can just update and (hopefully) this time things will actually be fixed.
OK, will wait for that.
Okay, please try it when next you get a chance - the commit is in master.
+1
thanks
A user is reporting a problem with both master and v2.1.0rc3 on their cluster system. The failure signature is a hang in
MPI_Finalize
when running any MPI program, including hello_c.c from examples. The user was so kind as to do a git bisect and narrowed the failure down to this commit:https://github.com/open-mpi/ompi/commit/48fc33971870dd73aa8db1ed51466df813a641eb
The user noticed that on systems running SLURM that he is unable to reproduce the problem.