open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

ORTE: UNABLE TO SEND MESSAGE: No OOB path to target #12359

Closed johebll closed 6 months ago

johebll commented 7 months ago

Hello

i have a problem with a very generic and essential mpirun application of Open MPI: 4.1.5 over Ethernet, which seems to be a routing problem. But the underlying network has no access-/routing problems outside of the mpirun application.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI provided by OHPC:

Package: Open MPI abuild@ip-172-31-13-34 Distribution Open MPI: 4.1.5 Open MPI repo revision: v4.1.5 Open MPI release date: Feb 23, 2023 Open RTE: 4.1.5 Open RTE repo revision: v4.1.5 Open RTE release date: Feb 23, 2023 OPAL: 4.1.5 OPAL repo revision: v4.1.5 OPAL release date: Feb 23, 2023 MPI API: 3.1.0 Ident string: 4.1.5 Prefix: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5 Configured architecture: x86_64-pc-linux-gnu Configure host: ip-172-31-13-34 Configured by: abuild Configured on: Thu Aug 3 14:25:40 UTC 2023 Configure host: ip-172-31-13-34 Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5' '--disable-static' '--enable-builtin-atomics' '--with-sge' '--enable-mpi-cxx' '--with-hwloc=/opt/ohpc/pub/libs/hwloc' '--with-pmix=/opt/ohpc/admin/pmix' '--with-libevent=external' '--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0' '--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0' '--without-verbs' '--with-tm=/opt/pbs/' Built by: abuild

How Open MPI was installed

Installed via "openmpi4-pmix-gnu12-ohpc.x86_64" provided by OHPC 3.0, on Rocky 9.x.

System environment:


Details of the problem

Core problem apparently (sampled from the screen log):

ORTE does not know how to route a message to the specified daemon located on the indicated node:
[hpccn21:58388] [[31437,0],1] UNABLE TO SEND MESSAGE TO [[31437,0],0] TAG 63: No OOB path to target 
[mgmt01:355979] [[31261,0],0] ACTIVATE PROC [[31261,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234

Usecase 1: not specifying the network to use:

# No difference between ucx/ob1:
mpirun --allow-run-as-root --host cn21,cn22 -np 2 --mca pml ucx --mca routed direct hostname
mpirun --allow-run-as-root --host cn21,cn22 -np 2 --mca pml ob1 --mca routed direct hostname

[cn21:58388] mca:base:select: Auto-selecting odls components
[cn21:58388] mca:base:select:( odls) Querying component [default]
[cn21:58388] mca:base:select:( odls) Query of component [default] set priority to 10
[cn21:58388] mca:base:select:( odls) Querying component [pspawn]
[cn21:58388] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn21:58388] mca:base:select:( odls) Selected component [default]
[cn21:58388] mca: base: close: component pspawn closed
[cn21:58388] mca: base: close: unloading component pspawn
[cn21:58388] mca_base_component_repository_open: examining dynamic rtc MCA component "hwloc" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rtc_hwloc
[cn22:52757] mca_base_component_repository_open: opened dynamic filem MCA component "raw"
[cn22:52757] [[31437,0],2] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 21
[cn22:52757] [[31437,0],2] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 1
[cn22:52757] [[31437,0],2] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 10
[cn22:52757] [[31437,0],2] rml:base:send_buffer_nb() to peer [[31437,0],0] through conduit 0
[cn22:52757] [[31437,0],2] OOB_SEND: rml_oob_send.c:265
[cn22:52757] [[31437,0],2] ext3x:client get on proc [[31437,0],2] key (null)
[cn21:58388] mca_base_component_repository_open: opened dynamic rtc MCA component "hwloc"
[cn22:52757] [[31437,0],2] rml:base:send_buffer_nb() to peer [[31437,0],2] through conduit 0
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 27 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 50 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 51 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 6 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 28 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 59 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 15 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 33 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 31 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 5 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 3 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 21 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 1 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] posting recv
[cn22:52757] [[31437,0],2] posting persistent recv on tag 10 for peer [[WILDCARD],WILDCARD]
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 0
[cn22:52757] [[31437,0],2] oob:base:send unknown peer [[31437,0],0]
[cn22:52757] [[31437,0],2] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] rml:base:send_buffer_nb() to peer [[31437,0],0] through conduit 0
[cn22:52757] [[31437,0],2] OOB_SEND: rml_oob_send.c:265
[cn22:52757] [[31437,0],2] tcp:no route called for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] OOB_SEND: oob_tcp_component.c:1123
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 0
[cn22:52757] [[31437,0],2] oob:base:send unknown peer [[31437,0],0]
[cn22:52757] [[31437,0],2] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 1
[cn22:52757] [[31437,0],2] oob:base:send known transport for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] tcp:no route called for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] OOB_SEND: oob_tcp_component.c:1123
[cn22:52757] [[31437,0],2] tcp:no route called for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] OOB_SEND: oob_tcp_component.c:1123
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 1
[cn22:52757] [[31437,0],2] oob:base:send unknown peer [[31437,0],0]
[cn22:52757] [[31437,0],2] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 2
[cn22:52757] [[31437,0],2] oob:base:send known transport for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] tcp:no route called for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] OOB_SEND: oob_tcp_component.c:1123
[cn22:52757] [[31437,0],2] tcp:no route called for peer [[31437,0],0]
[cn22:52757] [[31437,0],2] OOB_SEND: oob_tcp_component.c:1123
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 2
[cn22:52757] [[31437,0],2] oob:base:send unknown peer [[31437,0],0]
[cn22:52757] [[31437,0],2] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn22:52757] [[31437,0],2] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn22:52757] [[31437,0],2]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn22:52757] [[31437,0],2]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn22:52757] [[31437,0],2] oob:base:send to target [[31437,0],0] - attempt 3
[cn22:52757] [[31437,0],2]-[[31437,0],0] Send message complete at base/oob_base_stubs.c:61
[cn22:52757] [[31437,0],2] UNABLE TO SEND MESSAGE TO [[31437,0],0] TAG 63: No OOB path to target
[cn22:52757] [[31437,0],2] ACTIVATE PROC [[31437,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:52757] [[31437,0],2] Finalizing PMIX server
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 28
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 59
[cn22:52757] psquash: flex128 finalize
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "seq" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_seq
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "seq"
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "resilient" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_resilient
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "resilient"
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "rank_file" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_rank_file
[cn22:52757] mca: base: close: component ext3x closed
[cn22:52757] mca: base: close: unloading component ext3x
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "rank_file"
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "ppr" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_ppr
[cn22:52757] [[31437,0],2] rml:base:close_conduit(0)
[cn22:52757] [[31437,0],2] rml:base:close_conduit(1)
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "ppr"
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "mindist" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_mindist
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "mindist"
[cn21:58388] mca_base_component_repository_open: examining dynamic rmaps MCA component "round_robin" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_round_robin
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 3
[cn21:58388] mca_base_component_repository_open: opened dynamic rmaps MCA component "round_robin"
[cn21:58388] mca_base_component_repository_open: examining dynamic regx MCA component "fwd" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_fwd
[cn21:58388] mca_base_component_repository_open: opened dynamic regx MCA component "fwd"
[cn21:58388] mca_base_component_repository_open: examining dynamic regx MCA component "naive" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_naive
[cn22:52757] [[31437,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 5
[cn22:52757] mca: base: close: component rsh closed
[cn22:52757] mca: base: close: unloading component rsh
[cn21:58388] mca_base_component_repository_open: opened dynamic regx MCA component "naive"
[cn21:58388] mca_base_component_repository_open: examining dynamic regx MCA component "reverse" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_reverse
[cn21:58388] mca_base_component_repository_open: opened dynamic regx MCA component "reverse"
[cn21:58388] [[31437,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 5
[cn21:58388] mca_base_component_repository_open: examining dynamic iof MCA component "tool" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_tool
[cn22:52757] mca: base: close: component default closed
[cn22:52757] mca: base: close: unloading component default
[cn22:52757] mca: base: close: unloading component direct
[cn22:52757] mca: base: close: component oob closed
[cn22:52757] mca: base: close: unloading component oob
[cn22:52757] [[31437,0],2] TCP SHUTDOWN
[cn22:52757] no hnp or not active
[cn22:52757] [[31437,0],2] TCP SHUTDOWN done
[cn22:52757] mca: base: close: component tcp closed
[cn22:52757] mca: base: close: unloading component tcp
[cn22:52757] mca: base: close: component orted closed
[cn22:52757] mca: base: close: unloading component orted
[cn21:58388] mca_base_component_repository_open: opened dynamic iof MCA component "tool"
[cn21:58388] mca_base_component_repository_open: examining dynamic iof MCA component "orted" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_orted
[cn21:58388] mca_base_component_repository_open: opened dynamic iof MCA component "orted"
[cn21:58388] mca_base_component_repository_open: examining dynamic iof MCA component "hnp" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_hnp
[cn21:58388] mca_base_component_repository_open: opened dynamic iof MCA component "hnp"
[cn21:58388] [[31437,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 3
[cn21:58388] mca_base_component_repository_open: examining dynamic filem MCA component "raw" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_filem_raw
[cn21:58388] mca_base_component_repository_open: opened dynamic filem MCA component "raw"
[cn21:58388] [[31437,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 21
[cn21:58388] [[31437,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 1
[cn21:58388] [[31437,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 10
[cn21:58388] [[31437,0],1] rml:base:send_buffer_nb() to peer [[31437,0],0] through conduit 0
[cn21:58388] [[31437,0],1] OOB_SEND: rml_oob_send.c:265
[cn21:58388] [[31437,0],1] ext3x:client get on proc [[31437,0],1] key (null)
[cn22:52757] mca: base: close: component weighted closed
[cn22:52757] mca: base: close: unloading component weighted
[cn22:52757] mca: base: close: unloading component linux_ipv6
[cn22:52757] mca: base: close: unloading component posix_ipv4
[cn22:52757] mca: base: close: component dlopen closed
[cn22:52757] mca: base: close: unloading component dlopen
[mgmt01:355931] [[31437,0],0] ACTIVATE PROC [[31437,0],2] STATE FAILED TO START AT plm_rsh_module.c:318
[mgmt01:355931] [[31437,0],0] rml:base:send_buffer_nb() to peer [[31437,0],0] through conduit 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[mgmt01:355931] [[31437,0],0] rml:base:send_buffer_nb() to peer [[31437,0],1] through conduit 1
[mgmt01:355931] [[31437,0],0] OOB_SEND: rml_oob_send.c:265
[mgmt01:355931] [[31437,0],0] Message posted at grpcomm_direct.c:627 for tag 1
[mgmt01:355931] [[31437,0],0] oob:base:send to target [[31437,0],1] - attempt 0
[mgmt01:355931] [[31437,0],0] oob:base:send unknown peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] ext3x:client get on proc [[31437,0],1] key opal.puri
[mgmt01:355931] [[31437,0],0] oob:tcp:send_nb to peer [[31437,0],1]:15 seq = -1
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:188] processing send to peer [[31437,0],1]:15 seq_num = -1 hop [[31437,0],1] unknown
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:191] post no route to [[31437,0],1]
[mgmt01:355931] [[31437,0],0] tcp:no route called for peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355931] [[31437,0],0] oob:base:send to target [[31437,0],1] - attempt 1
[mgmt01:355931] [[31437,0],0] oob:base:send unknown peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] ext3x:client get on proc [[31437,0],1] key opal.puri
[mgmt01:355931] [[31437,0],0] oob:tcp:send_nb to peer [[31437,0],1]:15 seq = -1
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:188] processing send to peer [[31437,0],1]:15 seq_num = -1 hop [[31437,0],1] unknown
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:191] post no route to [[31437,0],1]
[mgmt01:355931] [[31437,0],0] tcp:no route called for peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355931] [[31437,0],0] oob:base:send to target [[31437,0],1] - attempt 2
[mgmt01:355931] [[31437,0],0] oob:base:send unknown peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] ext3x:client get on proc [[31437,0],1] key opal.puri
[mgmt01:355931] [[31437,0],0] oob:tcp:send_nb to peer [[31437,0],1]:15 seq = -1
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:188] processing send to peer [[31437,0],1]:15 seq_num = -1 hop [[31437,0],1] unknown
[mgmt01:355931] [[31437,0],0]:[oob_tcp.c:191] post no route to [[31437,0],1]
[mgmt01:355931] [[31437,0],0] tcp:no route called for peer [[31437,0],1]
[mgmt01:355931] [[31437,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355931] [[31437,0],0] oob:base:send to target [[31437,0],1] - attempt 3
[mgmt01:355931] [[31437,0],0]-[[31437,0],1] Send message complete at base/oob_base_stubs.c:61
[mgmt01:355931] [[31437,0],0] UNABLE TO SEND MESSAGE TO [[31437,0],1] TAG 15: No OOB path to target
[cn21:58388] [[31437,0],1] rml:base:send_buffer_nb() to peer [[31437,0],1] through conduit 0
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 27 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 50 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 51 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 6 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 28 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 59 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 15 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[mgmt01:355931] [[31437,0],0] ACTIVATE PROC [[31437,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:58388] [[31437,0],1] posting persistent recv on tag 33 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 31 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 5 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 3 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 21 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 1 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] posting recv
[cn21:58388] [[31437,0],1] posting persistent recv on tag 10 for peer [[WILDCARD],WILDCARD]
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 0
[cn21:58388] [[31437,0],1] oob:base:send unknown peer [[31437,0],0]
[cn21:58388] [[31437,0],1] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] rml:base:send_buffer_nb() to peer [[31437,0],0] through conduit 0
[cn21:58388] [[31437,0],1] OOB_SEND: rml_oob_send.c:265
[cn21:58388] [[31437,0],1] tcp:no route called for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 0
[cn21:58388] [[31437,0],1] oob:base:send unknown peer [[31437,0],0]
[cn21:58388] [[31437,0],1] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 1
[cn21:58388] [[31437,0],1] oob:base:send known transport for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] tcp:no route called for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn21:58388] [[31437,0],1] tcp:no route called for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 1
[cn21:58388] [[31437,0],1] oob:base:send unknown peer [[31437,0],0]
[cn21:58388] [[31437,0],1] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 2
[cn21:58388] [[31437,0],1] oob:base:send known transport for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:63 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:63 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] tcp:no route called for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn21:58388] [[31437,0],1] tcp:no route called for peer [[31437,0],0]
[cn21:58388] [[31437,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 2
[cn21:58388] [[31437,0],1] oob:base:send unknown peer [[31437,0],0]
[cn21:58388] [[31437,0],1] ext3x:client get on proc [[31437,0],0] key opal.puri
[cn21:58388] [[31437,0],1] oob:tcp:send_nb to peer [[31437,0],0]:10 seq = -1
[cn21:58388] [[31437,0],1]:[oob_tcp.c:188] processing send to peer [[31437,0],0]:10 seq_num = -1 hop [[31437,0],0] unknown
[cn21:58388] [[31437,0],1]:[oob_tcp.c:191] post no route to [[31437,0],0]
[cn21:58388] [[31437,0],1] oob:base:send to target [[31437,0],0] - attempt 3
[cn21:58388] [[31437,0],1]-[[31437,0],0] Send message complete at base/oob_base_stubs.c:61
[cn21:58388] [[31437,0],1] UNABLE TO SEND MESSAGE TO [[31437,0],0] TAG 63: No OOB path to target
[cn21:58388] [[31437,0],1] ACTIVATE PROC [[31437,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:58388] [[31437,0],1] Finalizing PMIX server
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 28
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 59
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mgmt01
  target node:  cn21

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[mgmt01:355931] [[31437,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[cn21:58388] psquash: flex128 finalize
[cn21:58388] mca: base: close: component ext3x closed
[cn21:58388] mca: base: close: unloading component ext3x
[cn21:58388] [[31437,0],1] rml:base:close_conduit(0)
[cn21:58388] [[31437,0],1] rml:base:close_conduit(1)
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 3
[cn21:58388] [[31437,0],1] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 5
[cn21:58388] mca: base: close: component rsh closed
[cn21:58388] mca: base: close: unloading component rsh

Usecase 2: specifying the network to use:

mpirun --allow-run-as-root --host cn21,cn22 -np 2 --mca pml ucx --mca routed direct --mca oob_tcp_if_include "10.10.90.0/16" /usr/bin/hostname

[cn22:52924] [[31261,0],2] oob:base:send to target [[31261,0],0] - attempt 2
[cn22:52924] [[31261,0],2] oob:base:send unknown peer [[31261,0],0]
[cn22:52924] [[31261,0],2] ext3x:client get on proc [[31261,0],0] key opal.puri
[cn22:52924] [[31261,0],2] oob:tcp:send_nb to peer [[31261,0],0]:10 seq = -1
[cn22:52924] [[31261,0],2]:[oob_tcp.c:188] processing send to peer [[31261,0],0]:10 seq_num = -1 hop [[31261,0],0] unknown
[cn22:52924] [[31261,0],2]:[oob_tcp.c:191] post no route to [[31261,0],0]
[cn22:52924] [[31261,0],2] oob:base:send to target [[31261,0],0] - attempt 3
[cn22:52924] [[31261,0],2]-[[31261,0],0] Send message complete at base/oob_base_stubs.c:61
[cn22:52924] [[31261,0],2] UNABLE TO SEND MESSAGE TO [[31261,0],0] TAG 63: No OOB path to target
[cn22:52924] [[31261,0],2] ACTIVATE PROC [[31261,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:52924] [[31261,0],2] Finalizing PMIX server
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 28
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 59
[cn22:52924] psquash: flex128 finalize
[cn21:58564] mca_base_component_repository_open: opened dynamic rtc MCA component "hwloc"
[cn22:52924] mca: base: close: component ext3x closed
[cn22:52924] mca: base: close: unloading component ext3x
[cn22:52924] [[31261,0],2] rml:base:close_conduit(0)
[cn22:52924] [[31261,0],2] rml:base:close_conduit(1)
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 3
[cn22:52924] [[31261,0],2] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 5
[cn22:52924] mca: base: close: component rsh closed
[cn22:52924] mca: base: close: unloading component rsh
[cn22:52924] mca: base: close: component default closed
[cn22:52924] mca: base: close: unloading component default
[cn22:52924] mca: base: close: unloading component direct
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "seq" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_seq
[cn22:52924] mca: base: close: component oob closed
[cn22:52924] mca: base: close: unloading component oob
[cn22:52924] [[31261,0],2] TCP SHUTDOWN
[cn22:52924] no hnp or not active
[cn22:52924] [[31261,0],2] TCP SHUTDOWN done
[cn22:52924] mca: base: close: component tcp closed
[cn22:52924] mca: base: close: unloading component tcp
[cn22:52924] mca: base: close: component orted closed
[cn22:52924] mca: base: close: unloading component orted
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "seq"
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "resilient" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_resilient
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "resilient"
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "rank_file" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_rank_file
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "rank_file"
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "ppr" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_ppr
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "ppr"
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "mindist" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_mindist
[cn22:52924] mca: base: close: component weighted closed
[cn22:52924] mca: base: close: unloading component weighted
[cn22:52924] mca: base: close: unloading component linux_ipv6
[cn22:52924] mca: base: close: unloading component posix_ipv4
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "mindist"
[cn21:58564] mca_base_component_repository_open: examining dynamic rmaps MCA component "round_robin" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_rmaps_round_robin
[cn21:58564] mca_base_component_repository_open: opened dynamic rmaps MCA component "round_robin"
[cn21:58564] mca_base_component_repository_open: examining dynamic regx MCA component "fwd" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_fwd
[cn21:58564] mca_base_component_repository_open: opened dynamic regx MCA component "fwd"
[cn21:58564] mca_base_component_repository_open: examining dynamic regx MCA component "naive" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_naive
[cn21:58564] mca_base_component_repository_open: opened dynamic regx MCA component "naive"
[cn21:58564] mca_base_component_repository_open: examining dynamic regx MCA component "reverse" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_regx_reverse
[cn22:52924] mca: base: close: component dlopen closed
[cn22:52924] mca: base: close: unloading component dlopen
[cn21:58564] mca_base_component_repository_open: opened dynamic regx MCA component "reverse"
[cn21:58564] [[31261,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 5
[cn21:58564] mca_base_component_repository_open: examining dynamic iof MCA component "tool" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_tool
[cn21:58564] mca_base_component_repository_open: opened dynamic iof MCA component "tool"
[cn21:58564] mca_base_component_repository_open: examining dynamic iof MCA component "orted" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_orted
[cn21:58564] mca_base_component_repository_open: opened dynamic iof MCA component "orted"
[cn21:58564] mca_base_component_repository_open: examining dynamic iof MCA component "hnp" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_iof_hnp
[cn21:58564] mca_base_component_repository_open: opened dynamic iof MCA component "hnp"
[cn21:58564] [[31261,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 3
[cn21:58564] mca_base_component_repository_open: examining dynamic filem MCA component "raw" at path /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/openmpi/mca_filem_raw
[mgmt01:355979] [[31261,0],0] ACTIVATE PROC [[31261,0],2] STATE FAILED TO START AT plm_rsh_module.c:318
[cn21:58564] mca_base_component_repository_open: opened dynamic filem MCA component "raw"
[cn21:58564] [[31261,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 21
[cn21:58564] [[31261,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 1
[cn21:58564] [[31261,0],1] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 10
[cn21:58564] [[31261,0],1] rml:base:send_buffer_nb() to peer [[31261,0],0] through conduit 0
[mgmt01:355979] [[31261,0],0] rml:base:send_buffer_nb() to peer [[31261,0],0] through conduit 1
[cn21:58564] [[31261,0],1] OOB_SEND: rml_oob_send.c:265
[cn21:58564] [[31261,0],1] ext3x:client get on proc [[31261,0],1] key (null)
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[mgmt01:355979] [[31261,0],0] rml:base:send_buffer_nb() to peer [[31261,0],1] through conduit 1
[mgmt01:355979] [[31261,0],0] OOB_SEND: rml_oob_send.c:265
[mgmt01:355979] [[31261,0],0] Message posted at grpcomm_direct.c:627 for tag 1
[mgmt01:355979] [[31261,0],0] oob:base:send to target [[31261,0],1] - attempt 0
[mgmt01:355979] [[31261,0],0] oob:base:send unknown peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] ext3x:client get on proc [[31261,0],1] key opal.puri
[mgmt01:355979] [[31261,0],0] oob:tcp:send_nb to peer [[31261,0],1]:15 seq = -1
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:188] processing send to peer [[31261,0],1]:15 seq_num = -1 hop [[31261,0],1] unknown
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:191] post no route to [[31261,0],1]
[mgmt01:355979] [[31261,0],0] tcp:no route called for peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355979] [[31261,0],0] oob:base:send to target [[31261,0],1] - attempt 1
[mgmt01:355979] [[31261,0],0] oob:base:send unknown peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] ext3x:client get on proc [[31261,0],1] key opal.puri
[mgmt01:355979] [[31261,0],0] oob:tcp:send_nb to peer [[31261,0],1]:15 seq = -1
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:188] processing send to peer [[31261,0],1]:15 seq_num = -1 hop [[31261,0],1] unknown
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:191] post no route to [[31261,0],1]
[mgmt01:355979] [[31261,0],0] tcp:no route called for peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355979] [[31261,0],0] oob:base:send to target [[31261,0],1] - attempt 2
[mgmt01:355979] [[31261,0],0] oob:base:send unknown peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] ext3x:client get on proc [[31261,0],1] key opal.puri
[mgmt01:355979] [[31261,0],0] oob:tcp:send_nb to peer [[31261,0],1]:15 seq = -1
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:188] processing send to peer [[31261,0],1]:15 seq_num = -1 hop [[31261,0],1] unknown
[mgmt01:355979] [[31261,0],0]:[oob_tcp.c:191] post no route to [[31261,0],1]
[mgmt01:355979] [[31261,0],0] tcp:no route called for peer [[31261,0],1]
[mgmt01:355979] [[31261,0],0] OOB_SEND: oob_tcp_component.c:1123
[mgmt01:355979] [[31261,0],0] oob:base:send to target [[31261,0],1] - attempt 3
[mgmt01:355979] [[31261,0],0]-[[31261,0],1] Send message complete at base/oob_base_stubs.c:61
[mgmt01:355979] [[31261,0],0] UNABLE TO SEND MESSAGE TO [[31261,0],1] TAG 15: No OOB path to target
[mgmt01:355979] [[31261,0],0] ACTIVATE PROC [[31261,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mgmt01
  target node:  cn21

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[mgmt01:355979] [[31261,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 36
[mgmt01:355979] [[31261,0],0] Finalizing PMIX server
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 50
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 51
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 6
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 28
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 59
[mgmt01:355979] psquash: flex128 finalize
[mgmt01:355979] mca: base: close: component ext3x closed
[mgmt01:355979] mca: base: close: unloading component ext3x
[mgmt01:355979] [[31261,0],0] rml:base:close_conduit(0)
[mgmt01:355979] [[31261,0],0] rml:base:close_conduit(1)
[mgmt01:355979] mca: base: close: component default closed
[mgmt01:355979] mca: base: close: unloading component default
[mgmt01:355979] mca: base: close: unloading component direct
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 5
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 10
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 12
[mgmt01:355979] [[31261,0],0] rml_recv_cancel for peer [[WILDCARD],WILDCARD] tag 62

Because the network routing and name resolution works on the CLI, i guess the problem seems to be on the application level.

Any help would be great, since i am totally stuck at this point.

rhc54 commented 7 months ago

Remove --mca routed direct and see if it works.

johebll commented 7 months ago

Hello rhc54,

thank you very much for chiming in!

I actually did, but it made no difference in the outcome. But getting there did, like that:

a) "--mca routed" unspecifies, defaults to radix:

### a)  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca pml ucx > result
[cn21:63007] mca:base:select: Auto-selecting odls components
[cn21:63007] mca:base:select:( odls) Querying component [default]
[cn21:63007] mca:base:select:( odls) Query of component [default] set priority to 10
[cn21:63007] mca:base:select:( odls) Querying component [pspawn]
[cn21:63007] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn21:63007] mca:base:select:( odls) Selected component [default]
[cn21:63007] mca: base: close: component pspawn closed
[cn21:63007] mca: base: close: unloading component pspawn
[cn21:63007] [[65266,0],0] Monitoring debugger attach fifo /tmp/ompi.cn21.2001/pid.63007/0/debugger_attach_fifo
[cn21:63007] [[65266,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:974
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:376
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:389
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:473
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:204
[cn21:63007] [[65266,0],0]: parent -1 num_children 1
[cn21:63007] [[65266,0],0]:     child 1
[cn21:63007] [[65266,0],0]: parent 0 num_children 1
[cn21:63007] [[65266,0],0]:     child 1
[cn21:63007] [[65266,0],0] plm:rsh: final template argv:
    /usr/bin/ssh <template>           PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted -mca ess "env" -mca ess_base_jobid "4277272576" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "hpccn[2:21-22]@0(2)" -mca orte_hnp_uri "4277272576.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:46969" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "4277272576.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:46969" -mca pml "ucx" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified

b) "--mca routed direct" specified:

### b)  /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca routed direct --mca pml ucx > result
[cn21:63018] mca:base:select: Auto-selecting odls components
[cn21:63018] mca:base:select:( odls) Querying component [default]
[cn21:63018] mca:base:select:( odls) Query of component [default] set priority to 10
[cn21:63018] mca:base:select:( odls) Querying component [pspawn]
[cn21:63018] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn21:63018] mca:base:select:( odls) Selected component [default]
[cn21:63018] mca: base: close: component pspawn closed
[cn21:63018] mca: base: close: unloading component pspawn
[cn21:63018] [[65223,0],0] Monitoring debugger attach fifo /tmp/ompi.cn21.2001/pid.63018/0/debugger_attach_fifo
[cn21:63018] [[65223,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:974
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:376
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:389
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:473
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:204
[cn21:63018] [[65223,0],0] plm:rsh: final template argv:
    /usr/bin/ssh <template>           PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted -mca ess "env" -mca ess_base_jobid "4274454528" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "hpccn[2:21-22]@0(2)" -mca orte_hnp_uri "4274454528.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:33987" -mca plm "rsh" --tree-spawn -mca routed "direct" -mca orte_parent_uri "4274454528.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:33987" -mca pml "ucx" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified

Then both a) and b) continue the same from there, except for the first line:

a) [cn22:57335] orte_routed_base_select: Initializing routed component direct
b) [cn22:57290] orte_routed_base_select: Initializing routed component radix

a) & b)
[cn22:57290] [[65266,0],1]: Final routed priorities
[cn22:57290]    Component: radix Priority: 70
[cn22:57290] mca: base: components_register: registering framework oob components
[cn22:57290] mca: base: components_register: found loaded component tcp
[cn22:57290] mca: base: components_register: component tcp register function successful
[cn22:57290] mca: base: components_open: opening oob components
[cn22:57290] mca: base: components_open: found loaded component tcp
[cn22:57290] mca: base: components_open: component tcp open function successful
[cn22:57290] mca:oob:select: checking available component tcp
[cn22:57290] mca:oob:select: Querying component [tcp]
[cn22:57290] oob:tcp: component_available called
[cn22:57290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init rejecting loopback interface lo
[cn22:57290] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.4.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.60.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.91.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.80.122 to our list of V4 connections
[cn22:57290] [[65266,0],1] TCP STARTUP
[cn22:57290] [[65266,0],1] attempting to bind to IPv4 port 0
[cn22:57290] [[65266,0],1] assigned IPv4 port 38569
[cn22:57290] mca:oob:select: Adding component to end
[cn22:57290] mca:oob:select: Found 1 active transports
[cn22:57290] [[65266,0],1]: get transports
[cn22:57290] [[65266,0],1]:get transports for component tcp
[cn22:57290] mca: base: components_register: registering framework odls components
[cn22:57290] mca: base: components_register: found loaded component default
[cn22:57290] mca: base: components_register: component default register function successful
[cn22:57290] mca: base: components_register: found loaded component pspawn
[cn22:57290] mca: base: components_register: component pspawn has no register or open function
[cn22:57290] mca: base: components_open: opening odls components
[cn22:57290] mca: base: components_open: found loaded component default
[cn22:57290] mca: base: components_open: component default open function successful
[cn22:57290] mca: base: components_open: found loaded component pspawn
[cn22:57290] mca: base: components_open: component pspawn open function successful
[cn22:57290] mca:base:select: Auto-selecting odls components
[cn22:57290] mca:base:select:( odls) Querying component [default]
[cn22:57290] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:57290] mca:base:select:( odls) Querying component [pspawn]
[cn22:57290] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:57290] mca:base:select:( odls) Selected component [default]
[cn22:57290] mca: base: close: component pspawn closed
[cn22:57290] mca: base: close: unloading component pspawn
[cn22:57290] [[65266,0],1]: parent 0 num_children 0
[cn22:57290] [[65266,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],1] key (null)
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 0
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 0
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 1
[cn22:57290] [[65266,0],1] oob:base:send known transport for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 1
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 2
[cn22:57290] [[65266,0],1] oob:base:send known transport for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 2
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 3
[cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:57290] psquash: flex128 finalize
[cn22:57290] mca: base: close: component ext3x closed
[cn22:57290] mca: base: close: unloading component ext3x
[cn22:57290] mca: base: close: component rsh closed
[cn22:57290] mca: base: close: unloading component rsh
[cn22:57290] mca: base: close: component default closed
[cn22:57290] mca: base: close: unloading component default
[cn22:57290] mca: base: close: unloading component radix
[cn22:57290] [[65266,0],1] TCP SHUTDOWN
[cn22:57290] no hnp or not active
[cn22:57290] [[65266,0],1] TCP SHUTDOWN done
[cn22:57290] mca: base: close: component tcp closed
[cn22:57290] mca: base: close: unloading component tcp
[cn22:57290] mca: base: close: component orted closed
[cn22:57290] mca: base: close: unloading component orted
[cn22:57290] mca: base: close: component weighted closed
[cn22:57290] mca: base: close: unloading component weighted
[cn22:57290] mca: base: close: unloading component linux_ipv6
[cn22:57290] mca: base: close: unloading component posix_ipv4
[cn22:57290] mca: base: close: component dlopen closed
[cn22:57290] mca: base: close: unloading component dlopen
[cn21:63007] [[65266,0],0] ACTIVATE PROC [[65266,0],1] STATE FAILED TO START AT plm_rsh_module.c:318
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
[ ... ]
--------------------------------------------------------------------------
[cn21:63007] [[65266,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:420
[cn21:63007] psquash: flex128 finalize
[cn21:63007] mca: base: close: component ext3x closed
[cn21:63007] mca: base: close: unloading component ext3x
[cn21:63007] mca: base: close: component default closed
[cn21:63007] mca: base: close: unloading component default
[cn21:63007] mca: base: close: unloading component radix
[cn21:63007] mca: base: close: unloading component direct
[cn21:63007] mca: base: close: unloading component binomial
[cn21:63007] mca: base: close: component rsh closed
[cn21:63007] mca: base: close: unloading component rsh
[cn21:63007] mca: base: close: component hnp closed
[cn21:63007] mca: base: close: unloading component hnp
[cn21:63007] [[65266,0],0] TCP SHUTDOWN
[cn21:63007] [[65266,0],0] TCP SHUTDOWN done
[cn21:63007] mca: base: close: component tcp closed
[cn21:63007] mca: base: close: unloading component tcp
[cn21:63007] mca: base: close: component weighted closed
[cn21:63007] mca: base: close: unloading component weighted
[cn21:63007] mca: base: close: unloading component linux_ipv6
[cn21:63007] mca: base: close: unloading component posix_ipv4
[cn21:63007] mca: base: close: component dlopen closed
[cn21:63007] mca: base: close: unloading component dlopen

I also tried to enforce the network to use via mca on top, but without any improvemnet:

/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname --mca oob_tcp_if_include "10.10.90.0/16" --mca routed "direct" --mca pml "ucx" -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self > result

So for me it looks currently like:

  1. "It" finds all interfaces,
  2. switches to transport tcp properly,
  3. but then fails to find the route to the peer:
++++++++++++++++
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 3
[cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
++++++++++++++++

@ [cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
## /open-mpi/ompi/blob/v4.1.x/orte/mca/oob/tcp/oob_tcp.c:176
    /* do we have a route to this peer (could be direct)? */
    hop = orte_routed.get_route(msg->routed, &msg->dst);
    /* do we know this hop? */
    if (NULL == (peer = mca_oob_tcp_peer_lookup(&hop))) {
        /* push this back to the component so it can try
         * another module within this transport. If no
         * module can be found, the component can push back
         * to the framework so another component can try
         */
        opal_output_verbose(2, orte_oob_base_framework.framework_output,
                            "%s:[%s:%d] processing send to peer %s:%d seq_num = %d hop %s unknown",
                            ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                            __FILE__, __LINE__,
                            ORTE_NAME_PRINT(&msg->dst), msg->tag, msg->seq_num,
                            ORTE_NAME_PRINT(&hop));
        ORTE_ACTIVATE_TCP_NO_ROUTE(msg, &hop, mca_oob_tcp_component_no_route);
        return;
    }

@ [cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
## /open-mpi/ompi/blob/v4.1.x/orte/mca/rml/base/rml_base_frame.c:234
            ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_NO_PATH_TO_TARGET);
        } else if (ORTE_ERR_ADDRESSEE_UNKNOWN == status) {
            ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_PEER_UNKNOWN);
        } else {
            ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_UNABLE_TO_SEND_MSG);
        }

The PATH looks fine, i believe, so libraries should be available, as required?

/home/mpitestuser/.local/bin:/home/mpitestuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0/bin:/opt/ohpc/pub/compiler/gcc/12.2.0/bin

If i understand it right, it can't find the peer node, but I really can't see, why this fails, because name resolution, ping and ssh as the user executing the mpi job on/to each of the nodes works flawlessly...

Cheers

johebll commented 7 months ago

Hello, just realised i had the wrong netmask for restricting the network "--mca oob_tcp_if_include "10.10.90.0/16"", but changing to "--mca oob_tcp_if_include "10.10.90.0/24"" did not solve the problem, while it still now properly restricted the network traffic as intended:

[cn22:58351] oob:tcp: component_available called
[cn22:58351] [[62077,0],1] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
[cn22:58351] oob:tcp: Found match: 10.10.90.122 (enp7s0)
[cn22:58351] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface lo (not in include list)
[cn22:58351] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface eno1 (not in include list)
[cn22:58351] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
[cn22:58351] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:58351] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
[cn22:58351] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp33s0f0 (not in include list)
[cn22:58351] [[62077,0],1] TCP STARTUP

It still then keeps failing like this:

[cn22:58351] [[62077,0],1] TCP STARTUP
[cn22:58351] [[62077,0],1] attempting to bind to IPv4 port 0
[cn22:58351] [[62077,0],1] assigned IPv4 port 54033
[cn22:58351] mca:oob:select: Adding component to end
[cn22:58351] mca:oob:select: Found 1 active transports
[cn22:58351] [[62077,0],1]: get transports
[cn22:58351] [[62077,0],1]:get transports for component tcp
[cn22:58351] mca: base: components_register: registering framework odls components
[cn22:58351] mca: base: components_register: found loaded component default
[cn22:58351] mca: base: components_register: component default register function successful
[cn22:58351] mca: base: components_register: found loaded component pspawn
[cn22:58351] mca: base: components_register: component pspawn has no register or open function
[cn22:58351] mca: base: components_open: opening odls components
[cn22:58351] mca: base: components_open: found loaded component default
[cn22:58351] mca: base: components_open: component default open function successful
[cn22:58351] mca: base: components_open: found loaded component pspawn
[cn22:58351] mca: base: components_open: component pspawn open function successful
[cn22:58351] mca:base:select: Auto-selecting odls components
[cn22:58351] mca:base:select:( odls) Querying component [default]
[cn22:58351] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:58351] mca:base:select:( odls) Querying component [pspawn]
[cn22:58351] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:58351] mca:base:select:( odls) Selected component [default]
[cn22:58351] mca: base: close: component pspawn closed
[cn22:58351] mca: base: close: unloading component pspawn
[cn22:58351] [[62077,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:58351] [[62077,0],1] ext3x:client get on proc [[62077,0],1] key (null)
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 0
[cn22:58351] [[62077,0],1] oob:base:send unknown peer [[62077,0],0]
[cn22:58351] [[62077,0],1] ext3x:client get on proc [[62077,0],0] key opal.puri
[cn22:58351] [[62077,0],1] oob:tcp:send_nb to peer [[62077,0],0]:63 seq = -1
[cn22:58351] [[62077,0],1]:[oob_tcp.c:188] processing send to peer [[62077,0],0]:63 seq_num = -1 hop [[62077,0],0] unknown
[cn22:58351] [[62077,0],1]:[oob_tcp.c:191] post no route to [[62077,0],0]
[cn22:58351] [[62077,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:58351] [[62077,0],1] tcp:no route called for peer [[62077,0],0]
[cn22:58351] [[62077,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 0

( # More of the same... )

[cn22:58351] [[62077,0],1]:[oob_tcp.c:191] post no route to [[62077,0],0]
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 3
[cn22:58351] [[62077,0],1] ACTIVATE PROC [[62077,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:58351] psquash: flex128 finalize
[cn22:58351] mca: base: close: component ext3x closed
[cn22:58351] mca: base: close: unloading component ext3x
[cn22:58351] mca: base: close: component rsh closed
[cn22:58351] mca: base: close: unloading component rsh
[cn22:58351] mca: base: close: component default closed
[cn22:58351] mca: base: close: unloading component default
[cn22:58351] mca: base: close: unloading component direct
[cn22:58351] [[62077,0],1] TCP SHUTDOWN
[cn22:58351] no hnp or not active
[cn22:58351] [[62077,0],1] TCP SHUTDOWN done
[cn22:58351] mca: base: close: component tcp closed
[cn22:58351] mca: base: close: unloading component tcp
[cn22:58351] mca: base: close: component orted closed
[cn22:58351] mca: base: close: unloading component orted
[cn22:58351] mca: base: close: component weighted closed
[cn22:58351] mca: base: close: unloading component weighted
[cn22:58351] mca: base: close: unloading component linux_ipv6
[cn22:58351] mca: base: close: unloading component posix_ipv4
[cn22:58351] mca: base: close: component dlopen closed
[cn22:58351] mca: base: close: unloading component dlopen
rhc54 commented 7 months ago

Wait a minute - you have an error on your cmd line:

mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca routed direct --mca pml ucx > result

You put the application (/usr/bin/hostname) right in the middle of the command, which means that the rest of the cmd line is ignored. So all those -x and --mca options are being passed to hostname and not being interpreted by OMPI.

Fix your cmd line and try it again.

johebll commented 7 months ago

Thank you very much for this!

I corrected it as documented in the following, but the problem still seems to be fundamentally the same. I ran the same command in 2 variants: direct and radix. Both fail...

I added some comments where s.th. look striking.

At tis point, because

it appears to me that actually the OOB components is in trouble?

Maybe i misread it, but this section puzzles me the most:

[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]

Not sure whether these are actual L3 routes, or whether this happens on L7 (processes)? Because on CLI, routing is perfectly fine, and apparently ORTE is using just the single network, as specified in the command...

Strange to me...

Here the details:

/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 --mca oob_tcp_if_include "10.10.90.0/24" --mca oob_base_verbose "100" --mca pml "ucx" -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self /usr/bin/hostname > result

execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun", 
["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., 
"-N", "1", "-n", "2", "-host", "cn22", 
"--mca", "oob_tcp_if_include", "10.10.90.0/24", 
"--mca", "oob_base_verbose", "100", 
"--mca", "pml", "ucx", 
"-x", "UCX_NET_DEVICES=enp7s0", 
"-x", "UCX_TLS=tcp,sm,self", 
"/usr/bin/hostname"], 0x7ffd96605130 /* 56 vars */) = 0

[ etc. ] 

socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 19
bind(19, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
getsockname(19, {sa_family=AF_NETLINK, nl_pid=359190, nl_groups=00000000}, [12]) = 0
sendto(19, [{nlmsg_len=20, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1708684694, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ...}], 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1388, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_LOOPBACK, ifi_index=if_nametoindex("lo"), ifi_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=7, nla_type=IFLA_IFNAME}, "lo"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 0], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 65536], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 0], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 0], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\xf8\xff\x07\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=12, nla_type=IFLA_QDISC}, "noqueue"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 0], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 00:00:00:00:00:00], [{nla_len=10, nla_type=IFLA_BROADCAST}, 00:00:00:00:00:00], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=776, nla_type=IFLA_AF_SPEC}, [[{nla_len=136, nla_type=AF_INET}, [{nla_len=132, nla_type=IFLA_INET_CONF}, [[IPV4_DEVCONF_FORWARDING-1] = 0, [IPV4_DEVCONF_MC_FORWARDING-1] = 0, [IPV4_DEVCONF_PROXY_ARP-1] = 0, [IPV4_DEVCONF_ACCEPT_REDIRECTS-1] = 1, [IPV4_DEVCONF_SECURE_REDIRECTS-1] = 1, [IPV4_DEVCONF_SEND_REDIRECTS-1] = 1, [IPV4_DEVCONF_SHARED_MEDIA-1] = 1, [IPV4_DEVCONF_RP_FILTER-1] = 1, [IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE-1] = 0, [IPV4_DEVCONF_BOOTP_RELAY-1] = 0, [IPV4_DEVCONF_LOG_MARTIANS-1] = 0, [IPV4_DEVCONF_TAG-1] = 0, [IPV4_DEVCONF_ARPFILTER-1] = 0, [IPV4_DEVCONF_MEDIUM_ID-1] = 0, [IPV4_DEVCONF_NOXFRM-1] = 1, [IPV4_DEVCONF_NOPOLICY-1] = 1, [IPV4_DEVCONF_FORCE_IGMP_VERSION-1] = 0, [IPV4_DEVCONF_ARP_ANNOUNCE-1] = 0, [IPV4_DEVCONF_ARP_IGNORE-1] = 0, [IPV4_DEVCONF_PROMOTE_SECONDARIES-1] = 1, [IPV4_DEVCONF_ARP_ACCEPT-1] = 0, [IPV4_DEVCONF_ARP_NOTIFY-1] = 0, [IPV4_DEVCONF_ACCEPT_LOCAL-1] = 0, [IPV4_DEVCONF_SRC_VMARK-1] = 0, [IPV4_DEVCONF_PROXY_ARP_PVLAN-1] = 0, [IPV4_DEVCONF_ROUTE_LOCALNET-1] = 0, [IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL-1] = 10000, [IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL-1] = 1000, [IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN-1] = 0, [IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST-1] = 0, [IPV4_DEVCONF_DROP_GRATUITOUS_ARP-1] = 0, [IPV4_DEVCONF_BC_FORWARDING-1] = 0]]], [{nla_len=636, nla_type=AF_INET6}, [[{nla_len=8, nla_type=IFLA_INET6_FLAGS}, IF_READY], [{nla_len=20, nla_type=IFLA_INET6_CACHEINFO}, {max_reasm_len=65535, tstamp=76376014, reachable_time=24780, retrans_time=1000}], [{nla_len=216, nla_type=IFLA_INET6_CONF}, [[DEVCONF_FORWARDING] = 0, [DEVCONF_HOPLIMIT] = 64, [DEVCONF_MTU6] = 65536, [DEVCONF_ACCEPT_RA] = 0, [DEVCONF_ACCEPT_REDIRECTS] = 1, [DEVCONF_AUTOCONF] = 1, [DEVCONF_DAD_TRANSMITS] = 1, [DEVCONF_RTR_SOLICITS] = -1, [DEVCONF_RTR_SOLICIT_INTERVAL] = 4000, [DEVCONF_RTR_SOLICIT_DELAY] = 1000, [DEVCONF_USE_TEMPADDR] = 0, [DEVCONF_TEMP_VALID_LFT] = 604800, [DEVCONF_TEMP_PREFERED_LFT] = 86400, [DEVCONF_REGEN_MAX_RETRY] = 3, [DEVCONF_MAX_DESYNC_FACTOR] = 600, [DEVCONF_MAX_ADDRESSES] = 16, [DEVCONF_FORCE_MLD_VERSION] = 0, [DEVCONF_ACCEPT_RA_DEFRTR] = 1, [DEVCONF_ACCEPT_RA_PINFO] = 1, [DEVCONF_ACCEPT_RA_RTR_PREF] = 1, [DEVCONF_RTR_PROBE_INTERVAL] = 60000, [DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN] = 0, [DEVCONF_PROXY_NDP] = 0, [DEVCONF_OPTIMISTIC_DAD] = 0, [DEVCONF_ACCEPT_SOURCE_ROUTE] = 0, [DEVCONF_MC_FORWARDING] = 0, [DEVCONF_DISABLE_IPV6] = 0, [DEVCONF_ACCEPT_DAD] = -1, [DEVCONF_FORCE_TLLAO] = 0, [DEVCONF_NDISC_NOTIFY] = 0, [DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL] = 10000, [DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL] = 1000, ...]], [{nla_len=300, nla_type=IFLA_INET6_STATS}, [[IPSTATS_MIB_NUM] = 37, [IPSTATS_MIB_INPKTS] = 3, [IPSTATS_MIB_INOCTETS] = 147, [IPSTATS_MIB_INDELIVERS] = 3, [IPSTATS_MIB_OUTFORWDATAGRAMS] = 0, [IPSTATS_MIB_OUTPKTS] = 3, [IPSTATS_MIB_OUTOCTETS] = 147, [IPSTATS_MIB_INHDRERRORS] = 0, [IPSTATS_MIB_INTOOBIGERRORS] = 0, [IPSTATS_MIB_INNOROUTES] = 0, [IPSTATS_MIB_INADDRERRORS] = 0, [IPSTATS_MIB_INUNKNOWNPROTOS] = 0, [IPSTATS_MIB_INTRUNCATEDPKTS] = 0, [IPSTATS_MIB_INDISCARDS] = 0, [IPSTATS_MIB_OUTDISCARDS] = 0, [IPSTATS_MIB_OUTNOROUTES] = 0, [IPSTATS_MIB_REASMTIMEOUT] = 0, [IPSTATS_MIB_REASMREQDS] = 0, [IPSTATS_MIB_REASMOKS] = 0, [IPSTATS_MIB_REASMFAILS] = 0, [IPSTATS_MIB_FRAGOKS] = 0, [IPSTATS_MIB_FRAGFAILS] = 0, [IPSTATS_MIB_FRAGCREATES] = 0, [IPSTATS_MIB_INMCASTPKTS] = 0, [IPSTATS_MIB_OUTMCASTPKTS] = 0, [IPSTATS_MIB_INBCASTPKTS] = 0, [IPSTATS_MIB_OUTBCASTPKTS] = 0, [IPSTATS_MIB_INMCASTOCTETS] = 0, [IPSTATS_MIB_OUTMCASTOCTETS] = 0, [IPSTATS_MIB_INBCASTOCTETS] = 0, [IPSTATS_MIB_OUTBCASTOCTETS] = 0, [IPSTATS_MIB_CSUMERRORS] = 0, ...]], [{nla_len=60, nla_type=IFLA_INET6_ICMP6STATS}, [[ICMP6_MIB_NUM] = 7, [ICMP6_MIB_INMSGS] = 0, [ICMP6_MIB_INERRORS] = 0, [ICMP6_MIB_OUTMSGS] = 0, [ICMP6_MIB_OUTERRORS] = 0, [ICMP6_MIB_CSUMERRORS] = 0, [6 /* ICMP6_MIB_??? */] = 0]], [{nla_len=20, nla_type=IFLA_INET6_TOKEN}, inet_pton(AF_INET6, "::")], [{nla_len=5, nla_type=IFLA_INET6_ADDR_GEN_MODE}, IN6_ADDR_GEN_MODE_NONE]]]]], ...]], [{nlmsg_len=1432, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp1s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp1s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 1500], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 65535], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=13, nla_type=IFLA_QDISC}, "fq_codel"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 1], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 52:54:00:04:96:11], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3355897, tx_packets=2327858, rx_bytes=2231387175, tx_bytes=464981524, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3355897, tx_packets=2327858, rx_bytes=2231387175, tx_bytes=464981524, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 52:54:00:04:96:11], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2820
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1428, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp7s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp7s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 9000], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 9702], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 16], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 16], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=7, nla_type=IFLA_QDISC}, "mq"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 3], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 24:6e:96:37:91:a0], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=29880428, tx_packets=91010231, rx_bytes=5677654305, tx_bytes=139188405615, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=29880428, tx_packets=91010231, rx_bytes=1382687009, tx_bytes=1749452143, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 24:6e:96:37:91:a0], ...]], [{nlmsg_len=1428, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp8s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp8s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 9000], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 9702], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 16], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 16], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=7, nla_type=IFLA_QDISC}, "mq"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 3], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 24:6e:96:37:91:b0], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3064344, tx_packets=2705253, rx_bytes=235395248, tx_bytes=200285367, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3064344, tx_packets=2705253, rx_bytes=235395248, tx_bytes=200285367, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 24:6e:96:37:91:b0], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2856
close(19)                               = 0

socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 19
bind(19, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
getsockname(19, {sa_family=AF_NETLINK, nl_pid=359190, nl_groups=00000000}, [12]) = 0
sendto(19, [{nlmsg_len=20, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1708684694, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ...}], 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1388, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_LOOPBACK, ifi_index=if_nametoindex("lo"), ifi_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=7, nla_type=IFLA_IFNAME}, "lo"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 0], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 65536], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 0], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 0], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\xf8\xff\x07\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=12, nla_type=IFLA_QDISC}, "noqueue"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 0], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 00:00:00:00:00:00], [{nla_len=10, nla_type=IFLA_BROADCAST}, 00:00:00:00:00:00], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=776, nla_type=IFLA_AF_SPEC}, [[{nla_len=136, nla_type=AF_INET}, [{nla_len=132, nla_type=IFLA_INET_CONF}, [[IPV4_DEVCONF_FORWARDING-1] = 0, [IPV4_DEVCONF_MC_FORWARDING-1] = 0, [IPV4_DEVCONF_PROXY_ARP-1] = 0, [IPV4_DEVCONF_ACCEPT_REDIRECTS-1] = 1, [IPV4_DEVCONF_SECURE_REDIRECTS-1] = 1, [IPV4_DEVCONF_SEND_REDIRECTS-1] = 1, [IPV4_DEVCONF_SHARED_MEDIA-1] = 1, [IPV4_DEVCONF_RP_FILTER-1] = 1, [IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE-1] = 0, [IPV4_DEVCONF_BOOTP_RELAY-1] = 0, [IPV4_DEVCONF_LOG_MARTIANS-1] = 0, [IPV4_DEVCONF_TAG-1] = 0, [IPV4_DEVCONF_ARPFILTER-1] = 0, [IPV4_DEVCONF_MEDIUM_ID-1] = 0, [IPV4_DEVCONF_NOXFRM-1] = 1, [IPV4_DEVCONF_NOPOLICY-1] = 1, [IPV4_DEVCONF_FORCE_IGMP_VERSION-1] = 0, [IPV4_DEVCONF_ARP_ANNOUNCE-1] = 0, [IPV4_DEVCONF_ARP_IGNORE-1] = 0, [IPV4_DEVCONF_PROMOTE_SECONDARIES-1] = 1, [IPV4_DEVCONF_ARP_ACCEPT-1] = 0, [IPV4_DEVCONF_ARP_NOTIFY-1] = 0, [IPV4_DEVCONF_ACCEPT_LOCAL-1] = 0, [IPV4_DEVCONF_SRC_VMARK-1] = 0, [IPV4_DEVCONF_PROXY_ARP_PVLAN-1] = 0, [IPV4_DEVCONF_ROUTE_LOCALNET-1] = 0, [IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL-1] = 10000, [IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL-1] = 1000, [IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN-1] = 0, [IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST-1] = 0, [IPV4_DEVCONF_DROP_GRATUITOUS_ARP-1] = 0, [IPV4_DEVCONF_BC_FORWARDING-1] = 0]]], [{nla_len=636, nla_type=AF_INET6}, [[{nla_len=8, nla_type=IFLA_INET6_FLAGS}, IF_READY], [{nla_len=20, nla_type=IFLA_INET6_CACHEINFO}, {max_reasm_len=65535, tstamp=76376014, reachable_time=24780, retrans_time=1000}], [{nla_len=216, nla_type=IFLA_INET6_CONF}, [[DEVCONF_FORWARDING] = 0, [DEVCONF_HOPLIMIT] = 64, [DEVCONF_MTU6] = 65536, [DEVCONF_ACCEPT_RA] = 0, [DEVCONF_ACCEPT_REDIRECTS] = 1, [DEVCONF_AUTOCONF] = 1, [DEVCONF_DAD_TRANSMITS] = 1, [DEVCONF_RTR_SOLICITS] = -1, [DEVCONF_RTR_SOLICIT_INTERVAL] = 4000, [DEVCONF_RTR_SOLICIT_DELAY] = 1000, [DEVCONF_USE_TEMPADDR] = 0, [DEVCONF_TEMP_VALID_LFT] = 604800, [DEVCONF_TEMP_PREFERED_LFT] = 86400, [DEVCONF_REGEN_MAX_RETRY] = 3, [DEVCONF_MAX_DESYNC_FACTOR] = 600, [DEVCONF_MAX_ADDRESSES] = 16, [DEVCONF_FORCE_MLD_VERSION] = 0, [DEVCONF_ACCEPT_RA_DEFRTR] = 1, [DEVCONF_ACCEPT_RA_PINFO] = 1, [DEVCONF_ACCEPT_RA_RTR_PREF] = 1, [DEVCONF_RTR_PROBE_INTERVAL] = 60000, [DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN] = 0, [DEVCONF_PROXY_NDP] = 0, [DEVCONF_OPTIMISTIC_DAD] = 0, [DEVCONF_ACCEPT_SOURCE_ROUTE] = 0, [DEVCONF_MC_FORWARDING] = 0, [DEVCONF_DISABLE_IPV6] = 0, [DEVCONF_ACCEPT_DAD] = -1, [DEVCONF_FORCE_TLLAO] = 0, [DEVCONF_NDISC_NOTIFY] = 0, [DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL] = 10000, [DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL] = 1000, ...]], [{nla_len=300, nla_type=IFLA_INET6_STATS}, [[IPSTATS_MIB_NUM] = 37, [IPSTATS_MIB_INPKTS] = 3, [IPSTATS_MIB_INOCTETS] = 147, [IPSTATS_MIB_INDELIVERS] = 3, [IPSTATS_MIB_OUTFORWDATAGRAMS] = 0, [IPSTATS_MIB_OUTPKTS] = 3, [IPSTATS_MIB_OUTOCTETS] = 147, [IPSTATS_MIB_INHDRERRORS] = 0, [IPSTATS_MIB_INTOOBIGERRORS] = 0, [IPSTATS_MIB_INNOROUTES] = 0, [IPSTATS_MIB_INADDRERRORS] = 0, [IPSTATS_MIB_INUNKNOWNPROTOS] = 0, [IPSTATS_MIB_INTRUNCATEDPKTS] = 0, [IPSTATS_MIB_INDISCARDS] = 0, [IPSTATS_MIB_OUTDISCARDS] = 0, [IPSTATS_MIB_OUTNOROUTES] = 0, [IPSTATS_MIB_REASMTIMEOUT] = 0, [IPSTATS_MIB_REASMREQDS] = 0, [IPSTATS_MIB_REASMOKS] = 0, [IPSTATS_MIB_REASMFAILS] = 0, [IPSTATS_MIB_FRAGOKS] = 0, [IPSTATS_MIB_FRAGFAILS] = 0, [IPSTATS_MIB_FRAGCREATES] = 0, [IPSTATS_MIB_INMCASTPKTS] = 0, [IPSTATS_MIB_OUTMCASTPKTS] = 0, [IPSTATS_MIB_INBCASTPKTS] = 0, [IPSTATS_MIB_OUTBCASTPKTS] = 0, [IPSTATS_MIB_INMCASTOCTETS] = 0, [IPSTATS_MIB_OUTMCASTOCTETS] = 0, [IPSTATS_MIB_INBCASTOCTETS] = 0, [IPSTATS_MIB_OUTBCASTOCTETS] = 0, [IPSTATS_MIB_CSUMERRORS] = 0, ...]], [{nla_len=60, nla_type=IFLA_INET6_ICMP6STATS}, [[ICMP6_MIB_NUM] = 7, [ICMP6_MIB_INMSGS] = 0, [ICMP6_MIB_INERRORS] = 0, [ICMP6_MIB_OUTMSGS] = 0, [ICMP6_MIB_OUTERRORS] = 0, [ICMP6_MIB_CSUMERRORS] = 0, [6 /* ICMP6_MIB_??? */] = 0]], [{nla_len=20, nla_type=IFLA_INET6_TOKEN}, inet_pton(AF_INET6, "::")], [{nla_len=5, nla_type=IFLA_INET6_ADDR_GEN_MODE}, IN6_ADDR_GEN_MODE_NONE]]]]], ...]], [{nlmsg_len=1432, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp1s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp1s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 1500], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 65535], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=13, nla_type=IFLA_QDISC}, "fq_codel"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 1], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 52:54:00:04:96:11], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3355910, tx_packets=2327885, rx_bytes=2231388033, tx_bytes=465000502, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3355910, tx_packets=2327885, rx_bytes=2231388033, tx_bytes=465000502, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 52:54:00:04:96:11], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2820
close(19)                               = 0

write(2, "[mgmt01:359190] [[28544,0],0]: "..., 57[mgmt01:359190] [[28544,0],0]: parent -1 num_children 1
) = 57
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 41[mgmt01:359190] [[28544,0],0]:     child 1
) = 41
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 56[mgmt01:359190] [[28544,0],0]: parent 0 num_children 1
) = 56
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 41[mgmt01:359190] [[28544,0],0]:     child 1
) = 41
getuid()                                = 2001

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 19
connect(19, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(19)                               = 0

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 19
connect(19, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(19)                               = 0
newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=2973, ...}, 0) = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 19
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=1542, ...}, AT_EMPTY_PATH) = 0
lseek(19, 0, SEEK_SET)                  = 0
read(19, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1542
close(19)                               = 0

write(12, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x1c0bb30, FUTEX_WAKE_PRIVATE, 1) = 1
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0

## reformatted for better readability

write(2, "[mgmt01:359190] [[28544,0],0] p"..., 911[mgmt01:359190] [[28544,0],0] plm:rsh: final template argv:
    /usr/bin/ssh <template>           PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; 
    export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; 
    export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; 
    export DYLD_LIBRARY_PATH ;   
    /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted 
    -mca ess "env" 
    -mca ess_base_jobid "1870659584" 
    -mca ess_base_vpid "<template>" 
    -mca ess_base_num_procs "2" 
    -mca orte_node_regex "mgmt[2:1],cn[2:22]@0(2)" 
    -mca orte_hnp_uri "1870659584.0;tcp://10.10.90.100:39333" 
    --mca oob_tcp_if_include "10.10.90.0/24" 
    --mca oob_base_verbose "100" 
    --mca pml "ucx" 
    -mca plm "rsh" 
    --tree-spawn 
    -mca routed "radix" 
    -mca orte_parent_uri "1870659584.0;tcp://10.10.90.100:39333" 
    -mca rmaps_ppr_n_pernode "1" 
    -mca pmix "^s1,s2,cray,isolated"
) = 911

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2ca440aa10) = 359193
setpgid(359193, 359193)                 = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
[cn22:63069] mca: base: components_register: registering framework dl components
[cn22:63069] mca: base: components_register: found loaded component dlopen
[cn22:63069] mca: base: components_register: component dlopen register function successful
[cn22:63069] mca: base: components_open: opening dl components
[cn22:63069] mca: base: components_open: found loaded component dlopen
[cn22:63069] mca: base: components_open: component dlopen open function successful
[cn22:63069] mca:base:select: Auto-selecting dl components
[cn22:63069] mca:base:select:(   dl) Querying component [dlopen]
[cn22:63069] mca:base:select:(   dl) Query of component [dlopen] set priority to 80
[cn22:63069] mca:base:select:(   dl) Selected component [dlopen]
[cn22:63069] mca: base: components_register: registering framework if components
[cn22:63069] mca: base: components_register: found loaded component linux_ipv6
[cn22:63069] mca: base: components_register: component linux_ipv6 has no register or open function
[cn22:63069] mca: base: components_register: found loaded component posix_ipv4
[cn22:63069] mca: base: components_register: component posix_ipv4 has no register or open function
[cn22:63069] mca: base: components_open: opening if components
[cn22:63069] mca: base: components_open: found loaded component linux_ipv6
[cn22:63069] mca: base: components_open: component linux_ipv6 open function successful
[cn22:63069] mca: base: components_open: found loaded component posix_ipv4
[cn22:63069] found interface lo
[cn22:63069] found interface eno1
[cn22:63069] found interface enp193s0f0
[cn22:63069] found interface enp7s0
[cn22:63069] found interface enp1s0f0
[cn22:63069] found interface enp33s0f0
[cn22:63069] mca: base: components_open: component posix_ipv4 open function successful
[cn22:63069] mca: base: components_register: registering framework reachable components
[cn22:63069] mca: base: components_register: found loaded component weighted
[cn22:63069] mca: base: components_register: component weighted register function successful
[cn22:63069] mca: base: components_open: opening reachable components
[cn22:63069] mca: base: components_open: found loaded component weighted
[cn22:63069] mca: base: components_open: component weighted open function successful
[cn22:63069] mca:base:select: Auto-selecting reachable components
[cn22:63069] mca:base:select:(reachable) Querying component [weighted]
[cn22:63069] mca:base:select:(reachable) Query of component [weighted] set priority to 1
[cn22:63069] mca:base:select:(reachable) Selected component [weighted]
[cn22:63069] mca: base: components_register: registering framework state components
[cn22:63069] mca: base: components_register: found loaded component tool
[cn22:63069] mca: base: components_register: component tool has no register or open function
[cn22:63069] mca: base: components_register: found loaded component orted
[cn22:63069] mca: base: components_register: component orted has no register or open function
[cn22:63069] mca: base: components_register: found loaded component hnp
[cn22:63069] mca: base: components_register: component hnp has no register or open function
[cn22:63069] mca: base: components_register: found loaded component app
[cn22:63069] mca: base: components_register: component app has no register or open function
[cn22:63069] mca: base: components_register: found loaded component novm
[cn22:63069] mca: base: components_register: component novm has no register or open function
[cn22:63069] mca: base: components_open: opening state components
[cn22:63069] mca: base: components_open: found loaded component tool
[cn22:63069] mca: base: components_open: component tool open function successful
[cn22:63069] mca: base: components_open: found loaded component orted
[cn22:63069] mca: base: components_open: component orted open function successful
[cn22:63069] mca: base: components_open: found loaded component hnp
[cn22:63069] mca: base: components_open: component hnp open function successful
[cn22:63069] mca: base: components_open: found loaded component app
[cn22:63069] mca: base: components_open: component app open function successful
[cn22:63069] mca: base: components_open: found loaded component novm
[cn22:63069] mca: base: components_open: component novm open function successful
[cn22:63069] mca:base:select: Auto-selecting state components
[cn22:63069] mca:base:select:(state) Querying component [tool]
[cn22:63069] mca:base:select:(state) Querying component [orted]
[cn22:63069] mca:base:select:(state) Query of component [orted] set priority to 100
[cn22:63069] mca:base:select:(state) Querying component [hnp]
[cn22:63069] mca:base:select:(state) Querying component [app]
[cn22:63069] mca:base:select:(state) Querying component [novm]
[cn22:63069] mca:base:select:(state) Selected component [orted]
[cn22:63069] mca: base: close: component tool closed
[cn22:63069] mca: base: close: unloading component tool
[cn22:63069] mca: base: close: component hnp closed
[cn22:63069] mca: base: close: unloading component hnp
[cn22:63069] mca: base: close: component app closed
[cn22:63069] mca: base: close: unloading component app
[cn22:63069] mca: base: close: component novm closed
[cn22:63069] mca: base: close: unloading component novm
[cn22:63069] ORTE_JOB_STATE_MACHINE:
[cn22:63069]    State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[cn22:63069]    State: FORCED EXIT cbfunc: DEFINED
[cn22:63069]    State: DAEMONS TERMINATED cbfunc: DEFINED
[cn22:63069] ORTE_PROC_STATE_MACHINE:
[cn22:63069]    State: RUNNING cbfunc: DEFINED
[cn22:63069]    State: SYNC REGISTERED cbfunc: DEFINED
[cn22:63069]    State: IOF COMPLETE cbfunc: DEFINED
[cn22:63069]    State: WAITPID FIRED cbfunc: DEFINED
[cn22:63069]    State: NORMALLY TERMINATED cbfunc: DEFINED
[cn22:63069] mca: base: components_register: registering framework plm components
[cn22:63069] mca: base: components_register: found loaded component rsh
[cn22:63069] mca: base: components_register: component rsh register function successful
[cn22:63069] mca: base: components_open: opening plm components
[cn22:63069] mca: base: components_open: found loaded component rsh
[cn22:63069] mca: base: components_open: component rsh open function successful
[cn22:63069] mca:base:select: Auto-selecting plm components
[cn22:63069] mca:base:select:(  plm) Querying component [rsh]
[cn22:63069] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[cn22:63069] mca:base:select:(  plm) Selected component [rsh]
[cn22:63069] mca: base: components_register: registering framework pmix components
[cn22:63069] mca: base: components_register: found loaded component flux
[cn22:63069] mca: base: components_register: component flux register function successful
[cn22:63069] mca: base: components_register: found loaded component ext3x
[cn22:63069] mca: base: components_register: component ext3x register function successful
[cn22:63069] mca: base: components_open: opening pmix components
[cn22:63069] mca: base: components_open: found loaded component flux
[cn22:63069] mca: base: components_open: found loaded component ext3x
[cn22:63069] mca: base: components_open: component ext3x open function successful
[cn22:63069] mca:base:select: Auto-selecting pmix components
[cn22:63069] mca:base:select:( pmix) Querying component [flux]
[cn22:63069] mca:base:select:( pmix) Querying component [ext3x]
[cn22:63069] mca:base:select:( pmix) Query of component [ext3x] set priority to 5
[cn22:63069] mca:base:select:( pmix) Selected component [ext3x]
[cn22:63069] mca: base: close: unloading component flux
[cn22:63069] psquash: flex128 init
[cn22:63069] psquash: native init
[cn22:63069] psquash: flex128 init
[cn22:63069] PMIX server errreg_cbfunc - error handler registered status=0, reference=1
[cn22:63069] mca: base: components_register: registering framework routed components
[cn22:63069] mca: base: components_register: found loaded component radix
[cn22:63069] mca: base: components_register: component radix register function successful
[cn22:63069] mca: base: components_open: opening routed components
[cn22:63069] mca: base: components_open: found loaded component radix
[cn22:63069] orte_routed_base_select: Initializing routed component radix
[cn22:63069] [[28544,0],1]: Final routed priorities
[cn22:63069]    Component: radix Priority: 70
[cn22:63069] mca: base: components_register: registering framework oob components
[cn22:63069] mca: base: components_register: found loaded component tcp
[cn22:63069] mca: base: components_register: component tcp register function successful
[cn22:63069] mca: base: components_open: opening oob components
[cn22:63069] mca: base: components_open: found loaded component tcp
[cn22:63069] mca: base: components_open: component tcp open function successful
[cn22:63069] mca:oob:select: checking available component tcp
[cn22:63069] mca:oob:select: Querying component [tcp]
[cn22:63069] oob:tcp: component_available called
[cn22:63069] [[28544,0],1] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
[cn22:63069] oob:tcp: Found match: 10.10.90.122 (enp7s0)
[cn22:63069] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface lo (not in include list)
[cn22:63069] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface eno1 (not in include list)
[cn22:63069] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
[cn22:63069] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:63069] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
[cn22:63069] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp33s0f0 (not in include list)
[cn22:63069] [[28544,0],1] TCP STARTUP
[cn22:63069] [[28544,0],1] attempting to bind to IPv4 port 0
[cn22:63069] [[28544,0],1] assigned IPv4 port 52207
[cn22:63069] mca:oob:select: Adding component to end
[cn22:63069] mca:oob:select: Found 1 active transports
[cn22:63069] [[28544,0],1]: get transports
[cn22:63069] [[28544,0],1]:get transports for component tcp
[cn22:63069] mca: base: components_register: registering framework odls components
[cn22:63069] mca: base: components_register: found loaded component default
[cn22:63069] mca: base: components_register: component default register function successful
[cn22:63069] mca: base: components_register: found loaded component pspawn
[cn22:63069] mca: base: components_register: component pspawn has no register or open function
[cn22:63069] mca: base: components_open: opening odls components
[cn22:63069] mca: base: components_open: found loaded component default
[cn22:63069] mca: base: components_open: component default open function successful
[cn22:63069] mca: base: components_open: found loaded component pspawn
[cn22:63069] mca: base: components_open: component pspawn open function successful
[cn22:63069] mca:base:select: Auto-selecting odls components
[cn22:63069] mca:base:select:( odls) Querying component [default]
[cn22:63069] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:63069] mca:base:select:( odls) Querying component [pspawn]
[cn22:63069] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:63069] mca:base:select:( odls) Selected component [default]

At this point it seems to fail (the following log line is a direct continuation from the previous log line):

[cn22:63069] mca: base: close: component pspawn closed
[cn22:63069] mca: base: close: unloading component pspawn
[cn22:63069] [[28544,0],1]: parent 0 num_children 0
[cn22:63069] [[28544,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],1] key (null)
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 0
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 0
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 1
[cn22:63069] [[28544,0],1] oob:base:send known transport for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 1
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 2
[cn22:63069] [[28544,0],1] oob:base:send known transport for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 2
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 3
[cn22:63069] [[28544,0],1] ACTIVATE PROC [[28544,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:63069] psquash: flex128 finalize
[cn22:63069] mca: base: close: component ext3x closed
[cn22:63069] mca: base: close: unloading component ext3x
[cn22:63069] mca: base: close: component rsh closed
[cn22:63069] mca: base: close: unloading component rsh
[cn22:63069] mca: base: close: component default closed
[cn22:63069] mca: base: close: unloading component default
[cn22:63069] mca: base: close: unloading component radix
[cn22:63069] [[28544,0],1] TCP SHUTDOWN
[cn22:63069] no hnp or not active
[cn22:63069] [[28544,0],1] TCP SHUTDOWN done
[cn22:63069] mca: base: close: component tcp closed
[cn22:63069] mca: base: close: unloading component tcp
[cn22:63069] mca: base: close: component orted closed
[cn22:63069] mca: base: close: unloading component orted
[cn22:63069] mca: base: close: component weighted closed
[cn22:63069] mca: base: close: unloading component weighted
[cn22:63069] mca: base: close: unloading component linux_ipv6
[cn22:63069] mca: base: close: unloading component posix_ipv4
[cn22:63069] mca: base: close: component dlopen closed
[cn22:63069] mca: base: close: unloading component dlopen
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=359193, si_uid=2001, si_status=1, si_utime=5, si_stime=1} ---
write(4, "\21", 1)                      = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024)                    = 1
read(3, 0x7f2ca444c360, 1024)           = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 359193
wait4(-1, 0x7ffc3d7ebd44, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
write(2, "[mgmt01:359190] [[28544,0],0] A"..., 105[mgmt01:359190] [[28544,0],0] ACTIVATE PROC [[28544,0],1] STATE FAILED TO START AT plm_rsh_module.c:318
) = 105
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 19
ioctl(19, TCGETS, 0x7ffc3d7eba90)       = -1 ENOTTY (Inappropriate ioctl for device)
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=4147, ...}, AT_EMPTY_PATH) = 0
read(19, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(19, "", 4096)                      = 0
close(19)                               = 0
write(2, "--------------------------------"..., 1137--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on

[ etc. :  the typical error message, ] 
rhc54 commented 7 months ago

I'm afraid you are misunderstanding the error message - this has nothing to do with the network. The OOB is complaining that it was never given the connection information for calling back to mpirun. Hence, it has no way of connecting back. The question is why wasn't it given the info?

You might look at the output from --mca plm_base_verbose 5 and see what the ssh command line looks like - it should be given there.

johebll commented 6 months ago

Thank you again for your time and help!

OOB is complaining that it was never given the connection information

Ahh, i understand. I would have preferred to debug a L2 problem ;-)

According to your proposal i repeated the mpirun with better logging, and also added some "-x":

/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun 
  --nolocal 
  --mca plm_base_verbose 5
  --mca oob_tcp_if_include "10.10.90.0/24" 
  --mca oob "tcp" 
  --mca btl "tcp,vader,self,sm" 
  --mca plm "rsh" 
  --mca routed "direct" 
  --mca pml "ucx" 
  -x UCX_TLS=tcp,vader,sm,self 
  -x UCX_TCP_AF_PRIO=inet
  -x UCX_NET_DEVICES=enp7s0 
  -x UCX_SHM_DEVICES=enp7s0 
  -x UCX_ACC_DEVICES=enp7s0 
  -x UCX_SELF_DEVICES=enp7s0 
  -x UCX_PROTOS=all 
  -x UCX_SOCKADDR_TLS_PRIORITY=tcp,sockcm 
  -x UCX_WARN_INVALID_CONFIG=y 
  -x UCX_ADDRESS_DEBUG_INFO=y 
  --host cn22 
  -np 2 
  /usr/bin/hostname 
  > result

The resulting SSH setup:

write(2, "[cn21:77506] [[9774,0],0] plm"..., 1138[cn21:77506] [[9774,0],0] plm:rsh: final template argv:
    /usr/bin/ssh &lt;template&gt;           
    PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; 
    LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; 
    DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   
    /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted 
    -mca ess "env" 
    -mca ess_base_jobid "640548864" 
    -mca ess_base_vpid "&lt;template&gt;" 
    -mca ess_base_num_procs "2" 
    -mca orte_node_regex "cn[2:21-22]@0(2)" 
    -mca orte_hnp_uri "640548864.0;tcp://10.10.90.121:45851" 
    --mca plm_base_verbose "5" 
    --mca oob_tcp_if_include "10.10.90.0/24" 
    --mca oob "tcp" 
    --mca btl "tcp,vader,self,sm" 
    --mca pml "ucx" 
    -mca plm "rsh" 
    --tree-spawn 
    -mca routed "direct" 
    -mca orte_parent_uri "640548864.0;tcp://10.10.90.121:45851" 
    -mca hwloc_base_report_bindings "1" 
    -mca orte_display_alloc "1" 
    -mca rmaps_base_no_schedule_local "1" 
    -mca pmix "^s1,s2,cray,isolated"</mark>
) = 1138
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff529d04a10) = 77509
setpgid(77509, 77509)                   = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified

Restriction of network works:

write(2, "[cn21:77506] mca: base: compo"..., 85[cn21:77506] mca: base: components_register: registering framework oob components
write(2, "[cn21:77506] mca: base: compo"..., 75[cn21:77506] mca: base: components_register: found loaded component tcp
write(2, "[cn21:77506] mca: base: compo"..., 91[cn21:77506] mca: base: components_register: component tcp register function successful
write(2, "[cn21:77506] mca: base: compo"..., 67[cn21:77506] mca: base: components_open: opening oob components
write(2, "[cn21:77506] mca: base: compo"..., 71[cn21:77506] mca: base: components_open: found loaded component tcp
write(2, "[cn21:77506] mca: base: compo"..., 83[cn21:77506] mca: base: components_open: component tcp open function successful
write(2, "[cn21:77506] mca:oob:select: "..., 65[cn21:77506] mca:oob:select: checking available component tcp
write(2, "[cn21:77506] mca:oob:select: "..., 57[cn21:77506] mca:oob:select: Querying component [tcp]
write(2, "[cn21:77506] oob:tcp: compone"..., 52[cn21:77506] oob:tcp: component_available called
write(2, "[cn21:77506] [[9774,0],0] oob"..., 92[cn21:77506] [[9774,0],0] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
write(2, "[cn21:77506] oob:tcp: Found m"..., 60[cn21:77506] oob:tcp: Found match: 10.10.90.121 (enp7s0)
write(2, "[cn21:77506] WORKING INTERFAC"..., 62[cn21:77506] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
write(2, "[cn21:77506] [[9774,0],0] oob"..., 87[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface lo (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 89[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface eno1 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 95[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 92[cn21:77506] [[9774,0],0] oob:tcp:init adding 10.10.90.121 to our list of V4 connections
write(2, "[cn21:77506] [[9774,0],0] oob"..., 93[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 94[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp33s0f0 (not in include list)

Loading of mca components all successful

[cn22:70820] mca: base: components_register: registering framework plm components
[cn22:70820] mca: base: components_register: found loaded component rsh
[cn22:70820] mca: base: components_register: component rsh register function successful
[cn22:70820] mca: base: components_open: opening plm components
[cn22:70820] mca: base: components_open: found loaded component rsh
[cn22:70820] mca: base: components_open: component rsh open function successful
[cn22:70820] mca:base:select: Auto-selecting plm components
[cn22:70820] mca:base:select:(  plm) Querying component [rsh]
[cn22:70820] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[cn22:70820] mca:base:select:(  plm) Selected component [rsh]
[cn22:70820] mca: base: components_register: registering framework routed components
[cn22:70820] mca: base: components_register: found loaded component direct
[cn22:70820] mca: base: components_register: component direct has no register or open function
[cn22:70820] mca: base: components_open: opening routed components
[cn22:70820] mca: base: components_open: found loaded component direct
[cn22:70820] orte_routed_base_select: Initializing routed component direct
[cn22:70820] [[9774,0],1]: Final routed priorities
[cn22:70820]    Component: direct Priority: 0
[cn22:70820] mca: base: components_register: registering framework oob components
[cn22:70820] mca: base: components_register: found loaded component tcp
[cn22:70820] mca: base: components_register: component tcp register function successful
[cn22:70820] mca: base: components_open: opening oob components
[cn22:70820] mca: base: components_open: found loaded component tcp
[cn22:70820] mca: base: components_open: component tcp open function successful
[cn22:70820] mca:oob:select: checking available component tcp
[cn22:70820] mca:oob:select: Querying component [tcp]
[cn22:70820] oob:tcp: component_available called

TCP session successful:

[cn22:70820] [[9774,0],1] TCP STARTUP
[cn22:70820] [[9774,0],1] attempting to bind to IPv4 port 0
[cn22:70820] [[9774,0],1] assigned IPv4 port 60373
[cn22:70820] mca:oob:select: Adding component to end
[cn22:70820] mca:oob:select: Found 1 active transports
[cn22:70820] [[9774,0],1]: get transports
[cn22:70820] [[9774,0],1]:get transports for component tcp
[cn22:70820] [[9774,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 0
[cn22:70820] [[9774,0],1] oob:base:send unknown peer [[9774,0],0]
[cn22:70820] [[9774,0],1] oob:tcp:send_nb to peer [[9774,0],0]:63 seq = -1
[cn22:70820] [[9774,0],1]:[oob_tcp.c:188] processing send to peer [[9774,0],0]:63 seq_num = -1 hop [[9774,0],0] unknown
[cn22:70820] [[9774,0],1]:[oob_tcp.c:191] post no route to [[9774,0],0]
[cn22:70820] [[9774,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:70820] [[9774,0],1] tcp:no route called for peer [[9774,0],0]
[cn22:70820] [[9774,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 0

[ etc. ]

[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 3
[cn22:70820] mca: base: close: component rsh closed
[cn22:70820] mca: base: close: unloading component rsh
[cn22:70820] mca: base: close: unloading component direct
[cn22:70820] [[9774,0],1] TCP SHUTDOWN
[cn22:70820] no hnp or not active
[cn22:70820] [[9774,0],1] TCP SHUTDOWN done
[cn22:70820] mca: base: close: component tcp closed
[cn22:70820] mca: base: close: unloading component tcp

Then the first errors again:

) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=77509, si_uid=2001, si_status=1, si_utime=1, si_stime=0} ---
write(4, "\21", 1)                      = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024)                    = 1
read(3, 0x7ff529f4f360, 1024)           = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 77509
wait4(-1, 0x7ffde776f8c4, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 19
ioctl(19, TCGETS, 0x7ffde776f610)       = -1 ENOTTY (Inappropriate ioctl for device)
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=4147, ...}, AT_EMPTY_PATH) = 0
read(19, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(19, "", 8192)                      = 0
close(19)                               = 0
write(2, "--------------------------------"..., 1137--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
) = 1137

1) I wonder: the "template" in the ssh setup line:

    -mca ess_base_vpid "<template>" 

is hopefully correct? I assume, the number/ID one would expect here will be filled in upon execution?

2) I also executed that ssh line from cn21 to cn22 step by step including the env settings, and was able to launch ORTED without any error message.

tcp        0      0 0.0.0.0:48261           0.0.0.0:*               LISTEN      2001       591373     71120/orted         
tcp        0      0 127.0.0.1:52737         0.0.0.0:*               LISTEN      2001       591372     71120/orted  

3) I doublechecked the actual user facing error message with it's proposals for resolving the problems and compared it to the configuration of the compilation of OpenMPI:

Configure command line: 
'--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static' 
'--enable-builtin-atomics'
'--with-sge' 
'--enable-mpi-cxx'
'--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-pmix=/opt/ohpc/admin/pmix'
'--with-libevent=external'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
'--without-verbs' 
'--with-tm=/opt/pbs/'

The library pathes are all present and accessible on both nodes.

4) And just to make sure: this OpenMPI is installed together with Slurm, but i disabled Slurm intentionally (all daemons shut down, Slurm's PAM lock for SSH is also disabled, so passwordless logins work flawlessly) to keep the debuggung environment simple. Also: i executed the mpirun directly on the compute nodes, no on the login node. I hope there is no remote chance left over, that a "dead" Slurm is still blocking access?

5)

The OOB is complaining that it was never given the connection information for calling back to mpirun. Hence, it has no way of connecting back. The question is why wasn't it given the info?

I believe to have cranked up all logging to it's max in another session afterwards but could not obtain any additional info. Could you please give me an idea where/what to look for in the output of the mpirun execution? Everything after "TCP STARTUP" is possibly too late?

If you would have any idea, i would be tremendously grateful, because currently i have the impression the problem is in the OpenMPI internal process communication, for which i have unfortunately little idea about how to look into, beyond of examining the logs...

Thank you.

ggouaillardet commented 6 months ago

First you should make sure there is no firewall between the hosts. for example, on cn21

$nc -l 45851

and then from an other terminal, on cn22

$echo hello | 10.10.90.121 45851

hello should be displayed on the first terminal.

Then you can run

$ ifconfig -a
$ netstat -nr

on cn22 to check whether there is routing between the two nodes or not.

Then if you want to use strace, what you really want is to strace the orted daemon spawned on cn22.

create a script orted.sh like this

#!/bin/sh

strace -f /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@"

make it executable and then from cn21

$ /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent `pwd`/orted.sh --mca host cn22 --mca oob_tcp_if_include 10.10.90.0/24 -np 1 hostname
johebll commented 6 months ago

Hello @ggouaillardet, thank you very much for your proposal!

1) Routes are enabled for each of the interfaces of both multihomed hosts. Tue one used for messaging has not the highest priority, but ping still works well. 2) Firewall is not installed 3) The netcat test succeeds in both directions

I will now test your proposal with strace, which sounds very promising.

At this occasion: from "OpenMPI_easybuild_tech_talks_01_OpenMPI_part2" i gathered, "ethernet-only" networks (as of 2020)

Should i therefore better focus the strace on

--mca pml "obi" --mca btl "tcp,vader,sm,self"

?

Best

ggouaillardet commented 6 months ago

pml is used by the MPI application only, and hostname is obiously not one, so long story short, it does not matter here.

btw, there was a typo, it should be --mca pml ob1

johebll commented 6 months ago

Hello @ggouaillardet

@typo: ha, incomplete cognitive spillover ;-)

I hope i understood you right.

The wrapper:

orted_wrapper.sh
#!/bin/sh
time=$(date +%y%m%d_%H%M%S)
echo -e "\n[$(date +%y%m%d_%H%M%S)] $0 launched on $(hostname -s)\n\n"| tee -a orted_sh_${time}.log
strace -f /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@" >> orted_sh_${time}.log
exit 0

Because of different behaviour, i tested 2 syntax variants.

1) mpitestuser@cn21: mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --mca host cn22 hostname
2) mpitestuser@cn21: mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 hostname

Test 1:

[mpitestuser@cn21:tty0]()[~]$
  date; strace /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --mca host cn22 /usr/bin/hostname; date
Sun Feb 25 07:18:42 PM CET 2024
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun", ["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., "--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh", "--mca", "oob_tcp_if_include", "10.10.90.0/24", "-np", "1", "--mca", "host", "cn22", "/usr/bin/hostname"], 0x7ffe60fe02b0 /* 39 vars */) = 0
brk(NULL)    

[ etc. ]

chdir("/home/mpitestuser")                   = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
chdir("/home/mpitestuser")                   = 0
write(2, "[cn21:84550] [[17066,0],0] od"..., 71[cn21:84550] [[17066,0],0] odls:launch spawning child [[17066,1],0]
) = 71
write(2, "[cn21:84550] \n Data for app_c"..., 5483[cn21:84550] 
 Data for app_context: index 0  app: /usr/bin/hostname
    Num procs: 1    FirstRank: 0
    Argv[0]: /usr/bin/hostname
    Env[0]: OMPI_MCA_orte_launch_agent=/home/mpitestuser/orted_wrapper.sh
    Env[1]: OMPI_MCA_oob_tcp_if_include=10.10.90.0/24
    Env[2]: OMPI_MCA_host=cn22
    Env[3]: OMPI_MCA_pmix=^s1,s2,cray,isolated
    Env[4]: PMIX_MCA_mca_base_component_show_load_errors=1
    Env[5]: PMIX_DEBUG=100
    Env[6]: OMPI_COMMAND=hostname
    Env[7]: OMPI_MCA_orte_precondition_transports=51fe5876dec26368-9d75d092c8cfa2bd
    Env[8]: SHELL=/bin/bash
    Env[9]: GREP_COLOR=7;31;43
    Env[10]: HISTCONTROL=ignoredups
    Env[11]: HISTSIZE=
    Env[12]: HOSTNAME=cn21
    Env[13]: HISTTIMEFORMAT=[%F %T] 
    Env[14]: PWD=/home/mpitestuser
    Env[15]: LOGNAME=mpitestuser
    Env[16]: XDG_SESSION_TYPE=tty
    Env[17]: MOTD_SHOWN=pam
    Env[18]: HOME=/home/mpitestuser
    Env[19]: LANG=en_US.UTF-8
    Env[20]: HISTFILE=/home/mpitestuser/.bash_history_hf
    Env[21]: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36::di=96:su=30;41:sg=30;41
    Env[22]: SSH_CONNECTION=10.10.90.100 38490 10.10.90.121 22
    Env[23]: SLRMDEFENVRS=/usr/local/bin/slurm/slurmd/slrmdefenvvars
    Env[24]: XDG_SESSION_CLASS=user
    Env[25]: SELINUX_ROLE_REQUESTED=
    Env[26]: TERM=xterm-256color
    Env[27]: LESSOPEN=||/usr/bin/lesspipe.sh %s
    Env[28]: USER=mpitestuser
    Env[29]: SELINUX_USE_CURRENT_RANGE=
    Env[30]: SHLVL=1
    Env[31]: XDG_SESSION_ID=596
    Env[32]: XDG_RUNTIME_DIR=/run/user/2001
    Env[33]: S_COLORS=auto
    Env[34]: PS1=\n\n\[\[\033[38;5;11m\][\u@\H:\[\]\[\033[38;5;190m\]tty\l\[\[\033[38;5;11m\]\[\033[38;5;11m\]\]]($(date +%y%m%d_%H%M%S))[\w]\[\033[38;5;81m\]$\[\033[0m\]\n \[\033[38;5;220m\]\[\033[48;5;24m\] \!.\#: \[\033[0m\]\[\] 
    Env[35]: SSH_CLIENT=10.10.90.100 38490 22
    Env[36]: DEBUGINFOD_URLS=https://debuginfod.centos.org/ 
    Env[37]: which_declare=declare -f
    Env[38]: XDG_DATA_DIRS=/home/mpitestuser/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
    Env[39]: PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/home/mpitestuser/.local/bin:/home/mpitestuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0/bin:/opt/ohpc/pub/compiler/gcc/12.2.0/bin
    Env[40]: SELINUX_LEVEL_REQUESTED=
    Env[41]: HISTFILESIZE=
    Env[42]: DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/2001/bus
    Env[43]: MAIL=/var/spool/mail/mpitestuser
    Env[44]: SSH_TTY=/dev/pts/0
    Env[45]: BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
    Env[46]: _=/usr/bin/strace
    Env[47]: IPATH_NO_BACKTRACE=1
    Env[48]: HFI_NO_BACKTRACE=1
    Env[49]: LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib
    Env[50]: OMPI_MCA_orte_local_daemon_uri=1118437376.0;tcp://10.10.90.121:44861
    Env[51]: OMPI_MCA_orte_hnp_uri=1118437376.0;tcp://10.10.90.121:44861
    Env[52]: OMPI_MCA_mpi_oversubscribe=0
    Env[53]: OMPI_MCA_orte_app_num=0
    Env[54]: OMPI_UNIVERSE_SIZE=96
    Env[55]: OMPI_MCA_orte_num_nodes=1
    Env[56]: OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
    Env[57]: OMPI_MCA_orte_bound_at_launch=1
    Env[58]: OMPI_MCA_ess=^singleton
    Env[59]: OMPI_MCA_orte_ess_num_procs=1
    Env[60]: OMPI_COMM_WORLD_SIZE=1
    Env[61]: OMPI_COMM_WORLD_LOCAL_SIZE=1
    Env[62]: OMPI_MCA_orte_tmpdir_base=/tmp
    Env[63]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.cn21.2001
    Env[64]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.cn21.2001/pid.84550
    Env[65]: OMPI_NUM_APP_CTX=1
    Env[66]: OMPI_FIRST_RANKS=0
    Env[67]: OMPI_APP_CTX_NUM_PROCS=1
    Env[68]: OMPI_MCA_initial_wdir=/home/mpitestuser
    Env[69]: OMPI_MCA_orte_launch=1
    Working dir: /home/mpitestuser
    Prefix: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5
    Used on node: TRUE
 ORTE_ATTR: GLOBAL Data type: OPAL_STRING   Key: APP-PREFIX-DIR Value: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5
 ORTE_ATTR: LOCAL Data type: OPAL_INT32 Key: APP-MAX-RESTARTS   Value: 0
) = 5483
pipe([25, 26])                          = 0

[ etc. ]

openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 9
newfstatat(9, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(9, 0x8d07e0 /* 9 entries */, 32768) = 272
close(9)                                = 0
munmap(0x7fbdeabf1000, 38280)           = 0
munmap(0x7fbdeafcf000, 16912)           = 0
munmap(0x7fbdeac2d000, 16912)           = 0
munmap(0x7fbdeac28000, 16976)           = 0
munmap(0x7fbdeac1c000, 48408)           = 0
munmap(0x7fbdeac17000, 16912)           = 0
write(2, "[cn21:84550] mca: base: close"..., 60[cn21:84550] mca: base: close: component weighted closed
) = 60
write(2, "[cn21:84550] mca: base: close"..., 63[cn21:84550] mca: base: close: unloading component weighted
) = 63
munmap(0x7fbdeac0d000, 16792)           = 0
close(5)                                = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbdeac89db0}, NULL, 8) = 0
close(3)                                = 0
close(4)                                = 0
munmap(0x7fbdeac12000, 17048)           = 0
write(2, "[cn21:84550] mca: base: close"..., 65[cn21:84550] mca: base: close: unloading component posix_ipv4
) = 65
munmap(0x7fbdeab4c000, 21072)           = 0
write(2, "[cn21:84550] mca: base: close"..., 58[cn21:84550] mca: base: close: component dlopen closed
) = 58
write(2, "[cn21:84550] mca: base: close"..., 61[cn21:84550] mca: base: close: unloading component dlopen
) = 61
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", 0x7fff09af80f0, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
close(3)                                = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001", {st_mode=S_IFDIR|0700, st_size=180, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.9774", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.69893", {st_mode=S_IFDIR|0700, st_size=140, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31671", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31261", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31483", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31431", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.29941", {st_mode=S_IFDIR|0700, st_size=100, ...}, 0) = 0
getdents64(3, 0x8d07e0 /* 0 entries */, 32768) = 0
close(3)                                = 0
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
Sun Feb 25 07:18:43 PM CET 2024

[mpitestuser@cn21:tty0]()[~]$

STDOUT only to CLI of the submitting user on cn21. No logfile ortedsh${time}.log was generated, therefore the wrapper was apparently not executed.

Test 2:

[mpitestuser@cn21:tty0]()[~]$
  date; strace /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname; date
Sun Feb 25 07:20:31 PM CET 2024
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun", ["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., "--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh", "--mca", "oob_tcp_if_include", "10.10.90.0/24", "-np", "1", "--host", "cn22", "/usr/bin/hostname"], 0x7ffee456e278 /* 39 vars */) = 0
brk(NULL)                               = 0x1d2f000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe425f4e70) = -1 EINVAL (Invalid argument)

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2d9d8b1000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

[ etc. ]

futex(0x1d67d58, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x1d67d08, FUTEX_WAKE_PRIVATE, 1) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0

write(2, "[cn21:84646] [[16970,0],0] pl"..., 865[cn21:84646] [[16970,0],0] plm:rsh: final template argv:
    /usr/bin/ssh <template>           
    PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; 
    LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; 
    DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   
    /home/mpitestuser/orted_wrapper.sh 
    -mca ess "env" 
    -mca ess_base_jobid "1112145920" 
    -mca ess_base_vpid "<template>" 
    -mca ess_base_num_procs "2" 
    -mca orte_node_regex "cn[2:21-22]@0(2)" 
    -mca orte_hnp_uri "1112145920.0;tcp://10.10.90.121:58463" 
    --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" 
    --mca oob_tcp_if_include "10.10.90.0/24" 
    -mca plm "rsh" 
    --tree-spawn 
    -mca routed "radix" 
    -mca orte_parent_uri "1112145920.0;tcp://10.10.90.121:58463" 
    -mca pmix "^s1,s2,cray,isolated"

) = 865
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2d9d39ea10) = 84649
setpgid(84649, 84649)                   = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified

[240225_192033] bash launched on cn22

execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted", 
["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., 
"-mca", "ess", "env", 
"-mca", "ess_base_jobid", "1112145920", 
"-mca", "ess_base_vpid", "1", 
"-mca", "ess_base_num_procs", "2", 
"-mca", "orte_node_regex", "cn[2:21-22]@0(2)", 
"-mca", "orte_hnp_uri", "1112145920.0;tcp://10.10.90.121:"..., 
"--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh", 
"--mca", "oob_tcp_if_include", "10.10.90.0/24", 
"-mca", "plm", "rsh", 
"--tree-spawn", 
"-mca", "routed", "radix", ...], 0x7ffd7bad58f0 /* 35 vars */) = 0
brk(NULL)                               = 0x10cc000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffecf1fe750) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2cc55c2000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v3/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v3", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v2/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v2", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\273\1\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=834408, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 773144, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2cc5505000
mmap(0x7f2cc551f000, 512000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f2cc551f000
mmap(0x7f2cc559c000, 122880, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x97000) = 0x7f2cc559c000
mmap(0x7f2cc55ba000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb4000) = 0x7f2cc55ba000
mmap(0x7f2cc55c0000, 7192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f2cc55c0000
close(3)                                = 0

[ etc. ]

write(2, "[cn22:77707] mca: base: close"..., 60[cn22:77707] mca: base: close: component weighted closed
) = 60
write(2, "[cn22:77707] mca: base: close"..., 63[cn22:77707] mca: base: close: unloading component weighted
) = 63
munmap(0x7f2cc50aa000, 16792)           = 0
close(5)                                = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f2cc5106db0}, NULL, 8) = 0
close(3)                                = 0
close(4)                                = 0
munmap(0x7f2cc544c000, 17048)           = 0
write(2, "[cn22:77707] mca: base: close"..., 65[cn22:77707] mca: base: close: unloading component posix_ipv4
) = 65
munmap(0x7f2cc507e000, 25888)           = 0
munmap(0x7f2cc506c000, 21072)           = 0
munmap(0x7f2cc45f0000, 25296)           = 0
munmap(0x7f2cc45e8000, 29368)           = 0
munmap(0x7f2cc45dd000, 42200)           = 0
munmap(0x7f2cc45d6000, 25256)           = 0
munmap(0x7f2cc45cf000, 25136)           = 0
munmap(0x7f2cc45c7000, 29232)           = 0
write(2, "[cn22:77707] mca: base: close"..., 58[cn22:77707] mca: base: close: component dlopen closed
) = 58
write(2, "[cn22:77707] mca: base: close"..., 61[cn22:77707] mca: base: close: unloading component dlopen
) = 61
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", 0x7ffecf1fe320, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
close(3)                                = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001", {st_mode=S_IFDIR|0700, st_size=60, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.9774", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
getdents64(3, 0x12b2080 /* 0 entries */, 32768) = 0
close(3)                                = 0
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=84649, si_uid=2001, si_status=0, si_utime=15, si_stime=63} ---
write(4, "\21", 1)                      = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024)                    = 1
read(3, 0x7f2d9d5e9360, 1024)           = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 84649
wait4(-1, 0x7ffe425f4ce4, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1

++++++ CLI STDOUT stuck here

^Cs ++ Cancelled on CLI

trace: Process 84646 detached
 <detached ...>
[cn21:84646] [[16970,0],0] OOB_SEND: rml_oob_send.c:265
[cn21:84646] [[16970,0],0] OOB_SEND: rml_oob_send.c:265
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 0
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 0
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 1
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 1
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 2
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri

[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 2
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 3
[cn21:84646] [[16970,0],0] ACTIVATE PROC [[16970,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:84646] [[16970,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 3
[cn21:84646] [[16970,0],0] ACTIVATE PROC [[16970,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:84646] [[16970,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[cn21:84646] psquash: flex128 finalize
[cn21:84646] mca: base: close: component ext3x closed
[cn21:84646] mca: base: close: unloading component ext3x
[cn21:84646] mca: base: close: component default closed
[cn21:84646] mca: base: close: unloading component default
[cn21:84646] mca: base: close: unloading component radix
[cn21:84646] mca: base: close: unloading component direct
[cn21:84646] mca: base: close: unloading component binomial
[cn21:84646] mca: base: close: component rsh closed
[cn21:84646] mca: base: close: unloading component rsh
[cn21:84646] mca: base: close: component hnp closed
[cn21:84646] mca: base: close: unloading component hnp
[cn21:84646] [[16970,0],0] TCP SHUTDOWN
[cn21:84646] [[16970,0],0] TCP SHUTDOWN done
[cn21:84646] mca: base: close: component tcp closed
[cn21:84646] mca: base: close: unloading component tcp
[cn21:84646] mca: base: close: component weighted closed
[cn21:84646] mca: base: close: unloading component weighted
[cn21:84646] mca: base: close: unloading component posix_ipv4
[cn21:84646] mca: base: close: component dlopen closed
[cn21:84646] mca: base: close: unloading component dlopen

[mpitestuser@cn21:tty0]()[~]$

This generated the following logfile:

[mpitestuser@cn22:tty0]()[~]$
  cat orted_sh_240225_192033.log

[240225_192033] bash launched on cn22

[mpitestuser@cn22:tty0]()[~]$

"Test 2" got stuck near its end, and i had to cancel the command. At that time i could not find any process related to the submitting user on cn22 any more.

Because the preview of this editor errored when pasting more log content, i attached a more complete log to this comment. If the complete (big) log would help, i would upload it any time.

Thank you for having a look!

Best

issue12359.txt.zip

ggouaillardet commented 6 months ago

Sorry for my mistake in the command line, the second test was the good one. The output is still messy, so let's try this:

orted_wrapper.sh

#!/bin/sh

export OMPI_MCA_oob_base_verbose=100
exec strace -f -o orted.strace -s 512 -- /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@"

and on cn21, simply run

/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname

(please do not strace /.../mpirun)

Then you can compress and upload orted.strace log file

johebll commented 6 months ago

Thank you, and sorry for the cluttered output.. I attached the strace according to your post. If i can check anything else, i'll be happy to this any time. Best issue12359_2.zip

ggouaillardet commented 6 months ago

Thanks, I suspect something fishy that involves PMIx (e.g. opal.puri is not set for mpirun)

Can you please run

/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca oob_base_verbose 100 --mca pmix_base_verbose 100 --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname

and compress and share the output?

johebll commented 6 months ago

I attached the starce for your command. Just in case it matters to you, here details about pmix:

srun --mpi=list
MPI plugin types are...
    cray_shasta
    pmix
    none
    pmi2
specific pmix plugin versions available: pmix_v4
/opt/ohpc/admin/pmix/bin/pmix_info
                 Package: PMIx abuild@ip-172-31-13-34 Distribution
                    PMIX: 4.2.6
      PMIX repo revision: gitf20e0d5d
       PMIX release date: Sep 09, 2023
           PMIX Standard: 4.2
       PMIX Standard ABI: Stable (0.0), Provisional (0.0)
                  Prefix: /opt/ohpc/admin/pmix
 Configured architecture: pmix.arch
          Configure host: ip-172-31-13-34
           Configured by: abuild
           Configured on: Sun Sep 10 16:30:23 UTC 2023
          Configure host: ip-172-31-13-34
  Configure command line: '--prefix=/opt/ohpc/admin/pmix'
                          '--with-hwloc=/opt/ohpc/pub/libs/hwloc'
                Built by: abuild
                Built on: Sun Sep 10 16:31:56 UTC 2023
              Built host: ip-172-31-13-34
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: "11" "." "3" "." "1"
  Internal debug support: no
              dl support: yes
     Symbol vis. support: yes
          Manpages built: yes
              MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
           MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v4.2.6)
                 MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v4.2.6)
              MCA pfexec: linux (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.6)
                 MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.2.6)
        MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v4.2.6)
        MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA plog: default (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA preg: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA prm: default (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA psec: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                MCA psec: none (MCA v2.1.0, API v1.0.0, Component v4.2.6)
             MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v4.2.6)
             MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
                          v4.2.6)
              MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v4.2.6)
             MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
             MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
               MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v4.2.6)
                 MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v4.2.6)
                 MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v4.2.6)
                 MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v4.2.6)

issue12359_3.txt.zip Thank you again.

ggouaillardet commented 6 months ago

I do not need strace for now.

But it seems you forgot to pass --mca pmix_base_verbose 100 to the mpirun command line.

johebll commented 6 months ago

Oh, damn, here it comes: issue12359_4.txt.zip

ggouaillardet commented 6 months ago

I am running out of ideas for today...

what if you export PMIX_MCA_gds=hash and try again?

johebll commented 6 months ago

I don't see any earthshattering differences, unfortunately :-/ I don't know whether that might matter, but OHPC provides a pmi library for Slurm as a separate package, and i did not install it so far, because it seems to overwrite native pmi files.

So currently my system has installed:

openmpi4-pmix-gnu12-ohpc.x86_64                                        4.1.5-300.ohpc.5.2                                         @OpenHPC  
pmix-ohpc.x86_64                                                       4.2.6-300.ohpc.3.1                                         @OpenHPC  

But not:

slurm-libpmi-ohpc.x86_64                22.05.11-302.ohpc.1.1    OpenHPC-updates

I attached the output for what you asked for in your last comment. issue12359_5.txt.zip

Should you have an idea what could be tried in the meantime, i would definitely give it a try.

ggouaillardet commented 6 months ago

you do not need slurm-libpmi-ohpc, it should only be useful if you do something like srun --mpi=pmi2 ...

I am running out of ideas, and pmix was not built with --enable-debug, so we miss useful traces. There was an issue with PMIx > 4.2.2 (on the Open MPI side) and the fix is only available in Open MPI 4.1.6.

At this stage, I would recommend you build Open MPI 4.1.6 with the same options (but in your $HOME directory) and see if this fixes the issue.

johebll commented 6 months ago

Ahh...

something like srun --mpi=pmi2 ...

Does this refer to specifying a particular pmi, or "pmi2" in particular, or would "slurm-libpmi-ohpc" be mandatory for all MPI application executions, as soon as they are executed using the scheduler? I could not find an explicit info about how to deal with this package "slurm-libpmi-ohpc" at the OHPC end, but gathered warnings somewhere else about native PMI files being overwritten by their Slurm version...

In our environment, it would have to be Slurm all the way.

build Open MPI 4.1.6 with the same options

That is a good idea, and i will go this route, unless installing "slurm-libpmi-ohpc" would be mandatory anyways, in which case i would try that first, looking whether this might change the landscape.

johebll commented 6 months ago

I just wanted to update that i had to prefer a different approach, because i give priority to stick with using OHPC packages.

My current approach to deal with this now is:

  1. Uninstalling: "openmpi4-pmix-gnu12-ohpc"
  2. Installing: "openmpi4-gnu12-ohpc.x86_64" "pmix-ohpc.x86_64"

Using "openmpi4-gnu12-ohpc.x86_64" immediately works. pmix-ohpc.x86_64 allows me to use pmix3.

The implied downside of not having the option to srun MPI applications directly is tolerable for the time being, but i will keep testing "openmpi4-pmix-gnu\<X>-ohpc" and switch to it as soon it will work.

Therefore i would then close the ticket?