Closed johebll closed 6 months ago
Remove --mca routed direct
and see if it works.
Hello rhc54,
thank you very much for chiming in!
I actually did, but it made no difference in the outcome. But getting there did, like that:
a) "--mca routed" unspecifies, defaults to radix:
### a) /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca pml ucx > result
[cn21:63007] mca:base:select: Auto-selecting odls components
[cn21:63007] mca:base:select:( odls) Querying component [default]
[cn21:63007] mca:base:select:( odls) Query of component [default] set priority to 10
[cn21:63007] mca:base:select:( odls) Querying component [pspawn]
[cn21:63007] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn21:63007] mca:base:select:( odls) Selected component [default]
[cn21:63007] mca: base: close: component pspawn closed
[cn21:63007] mca: base: close: unloading component pspawn
[cn21:63007] [[65266,0],0] Monitoring debugger attach fifo /tmp/ompi.cn21.2001/pid.63007/0/debugger_attach_fifo
[cn21:63007] [[65266,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:974
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:376
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:389
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:473
[cn21:63007] [[65266,0],0] ACTIVATE JOB [65266,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:204
[cn21:63007] [[65266,0],0]: parent -1 num_children 1
[cn21:63007] [[65266,0],0]: child 1
[cn21:63007] [[65266,0],0]: parent 0 num_children 1
[cn21:63007] [[65266,0],0]: child 1
[cn21:63007] [[65266,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ; /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted -mca ess "env" -mca ess_base_jobid "4277272576" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "hpccn[2:21-22]@0(2)" -mca orte_hnp_uri "4277272576.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:46969" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "4277272576.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:46969" -mca pml "ucx" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
b) "--mca routed direct" specified:
### b) /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca routed direct --mca pml ucx > result
[cn21:63018] mca:base:select: Auto-selecting odls components
[cn21:63018] mca:base:select:( odls) Querying component [default]
[cn21:63018] mca:base:select:( odls) Query of component [default] set priority to 10
[cn21:63018] mca:base:select:( odls) Querying component [pspawn]
[cn21:63018] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn21:63018] mca:base:select:( odls) Selected component [default]
[cn21:63018] mca: base: close: component pspawn closed
[cn21:63018] mca: base: close: unloading component pspawn
[cn21:63018] [[65223,0],0] Monitoring debugger attach fifo /tmp/ompi.cn21.2001/pid.63018/0/debugger_attach_fifo
[cn21:63018] [[65223,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:974
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:376
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:389
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:473
[cn21:63018] [[65223,0],0] ACTIVATE JOB [65223,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:204
[cn21:63018] [[65223,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ; /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted -mca ess "env" -mca ess_base_jobid "4274454528" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "hpccn[2:21-22]@0(2)" -mca orte_hnp_uri "4274454528.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:33987" -mca plm "rsh" --tree-spawn -mca routed "direct" -mca orte_parent_uri "4274454528.0;tcp://10.10.4.121,10.10.60.121,10.10.90.121,10.10.91.121,10.10.80.121:33987" -mca pml "ucx" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
Then both a) and b) continue the same from there, except for the first line:
a) [cn22:57335] orte_routed_base_select: Initializing routed component direct
b) [cn22:57290] orte_routed_base_select: Initializing routed component radix
a) & b)
[cn22:57290] [[65266,0],1]: Final routed priorities
[cn22:57290] Component: radix Priority: 70
[cn22:57290] mca: base: components_register: registering framework oob components
[cn22:57290] mca: base: components_register: found loaded component tcp
[cn22:57290] mca: base: components_register: component tcp register function successful
[cn22:57290] mca: base: components_open: opening oob components
[cn22:57290] mca: base: components_open: found loaded component tcp
[cn22:57290] mca: base: components_open: component tcp open function successful
[cn22:57290] mca:oob:select: checking available component tcp
[cn22:57290] mca:oob:select: Querying component [tcp]
[cn22:57290] oob:tcp: component_available called
[cn22:57290] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init rejecting loopback interface lo
[cn22:57290] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.4.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.60.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.91.122 to our list of V4 connections
[cn22:57290] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:57290] [[65266,0],1] oob:tcp:init adding 10.10.80.122 to our list of V4 connections
[cn22:57290] [[65266,0],1] TCP STARTUP
[cn22:57290] [[65266,0],1] attempting to bind to IPv4 port 0
[cn22:57290] [[65266,0],1] assigned IPv4 port 38569
[cn22:57290] mca:oob:select: Adding component to end
[cn22:57290] mca:oob:select: Found 1 active transports
[cn22:57290] [[65266,0],1]: get transports
[cn22:57290] [[65266,0],1]:get transports for component tcp
[cn22:57290] mca: base: components_register: registering framework odls components
[cn22:57290] mca: base: components_register: found loaded component default
[cn22:57290] mca: base: components_register: component default register function successful
[cn22:57290] mca: base: components_register: found loaded component pspawn
[cn22:57290] mca: base: components_register: component pspawn has no register or open function
[cn22:57290] mca: base: components_open: opening odls components
[cn22:57290] mca: base: components_open: found loaded component default
[cn22:57290] mca: base: components_open: component default open function successful
[cn22:57290] mca: base: components_open: found loaded component pspawn
[cn22:57290] mca: base: components_open: component pspawn open function successful
[cn22:57290] mca:base:select: Auto-selecting odls components
[cn22:57290] mca:base:select:( odls) Querying component [default]
[cn22:57290] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:57290] mca:base:select:( odls) Querying component [pspawn]
[cn22:57290] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:57290] mca:base:select:( odls) Selected component [default]
[cn22:57290] mca: base: close: component pspawn closed
[cn22:57290] mca: base: close: unloading component pspawn
[cn22:57290] [[65266,0],1]: parent 0 num_children 0
[cn22:57290] [[65266,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],1] key (null)
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 0
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 0
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 1
[cn22:57290] [[65266,0],1] oob:base:send known transport for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 1
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 2
[cn22:57290] [[65266,0],1] oob:base:send known transport for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:63 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:63 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] tcp:no route called for peer [[65266,0],0]
[cn22:57290] [[65266,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 2
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 3
[cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:57290] psquash: flex128 finalize
[cn22:57290] mca: base: close: component ext3x closed
[cn22:57290] mca: base: close: unloading component ext3x
[cn22:57290] mca: base: close: component rsh closed
[cn22:57290] mca: base: close: unloading component rsh
[cn22:57290] mca: base: close: component default closed
[cn22:57290] mca: base: close: unloading component default
[cn22:57290] mca: base: close: unloading component radix
[cn22:57290] [[65266,0],1] TCP SHUTDOWN
[cn22:57290] no hnp or not active
[cn22:57290] [[65266,0],1] TCP SHUTDOWN done
[cn22:57290] mca: base: close: component tcp closed
[cn22:57290] mca: base: close: unloading component tcp
[cn22:57290] mca: base: close: component orted closed
[cn22:57290] mca: base: close: unloading component orted
[cn22:57290] mca: base: close: component weighted closed
[cn22:57290] mca: base: close: unloading component weighted
[cn22:57290] mca: base: close: unloading component linux_ipv6
[cn22:57290] mca: base: close: unloading component posix_ipv4
[cn22:57290] mca: base: close: component dlopen closed
[cn22:57290] mca: base: close: unloading component dlopen
[cn21:63007] [[65266,0],0] ACTIVATE PROC [[65266,0],1] STATE FAILED TO START AT plm_rsh_module.c:318
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
[ ... ]
--------------------------------------------------------------------------
[cn21:63007] [[65266,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:420
[cn21:63007] psquash: flex128 finalize
[cn21:63007] mca: base: close: component ext3x closed
[cn21:63007] mca: base: close: unloading component ext3x
[cn21:63007] mca: base: close: component default closed
[cn21:63007] mca: base: close: unloading component default
[cn21:63007] mca: base: close: unloading component radix
[cn21:63007] mca: base: close: unloading component direct
[cn21:63007] mca: base: close: unloading component binomial
[cn21:63007] mca: base: close: component rsh closed
[cn21:63007] mca: base: close: unloading component rsh
[cn21:63007] mca: base: close: component hnp closed
[cn21:63007] mca: base: close: unloading component hnp
[cn21:63007] [[65266,0],0] TCP SHUTDOWN
[cn21:63007] [[65266,0],0] TCP SHUTDOWN done
[cn21:63007] mca: base: close: component tcp closed
[cn21:63007] mca: base: close: unloading component tcp
[cn21:63007] mca: base: close: component weighted closed
[cn21:63007] mca: base: close: unloading component weighted
[cn21:63007] mca: base: close: unloading component linux_ipv6
[cn21:63007] mca: base: close: unloading component posix_ipv4
[cn21:63007] mca: base: close: component dlopen closed
[cn21:63007] mca: base: close: unloading component dlopen
I also tried to enforce the network to use via mca on top, but without any improvemnet:
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname --mca oob_tcp_if_include "10.10.90.0/16" --mca routed "direct" --mca pml "ucx" -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self > result
So for me it looks currently like:
++++++++++++++++
[cn22:57290] [[65266,0],1] oob:base:send unknown peer [[65266,0],0]
[cn22:57290] [[65266,0],1] ext3x:client get on proc [[65266,0],0] key opal.puri
[cn22:57290] [[65266,0],1] oob:tcp:send_nb to peer [[65266,0],0]:10 seq = -1
[cn22:57290] [[65266,0],1]:[oob_tcp.c:188] processing send to peer [[65266,0],0]:10 seq_num = -1 hop [[65266,0],0] unknown
[cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
[cn22:57290] [[65266,0],1] oob:base:send to target [[65266,0],0] - attempt 3
[cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
++++++++++++++++
@ [cn22:57290] [[65266,0],1]:[oob_tcp.c:191] post no route to [[65266,0],0]
## /open-mpi/ompi/blob/v4.1.x/orte/mca/oob/tcp/oob_tcp.c:176
/* do we have a route to this peer (could be direct)? */
hop = orte_routed.get_route(msg->routed, &msg->dst);
/* do we know this hop? */
if (NULL == (peer = mca_oob_tcp_peer_lookup(&hop))) {
/* push this back to the component so it can try
* another module within this transport. If no
* module can be found, the component can push back
* to the framework so another component can try
*/
opal_output_verbose(2, orte_oob_base_framework.framework_output,
"%s:[%s:%d] processing send to peer %s:%d seq_num = %d hop %s unknown",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
__FILE__, __LINE__,
ORTE_NAME_PRINT(&msg->dst), msg->tag, msg->seq_num,
ORTE_NAME_PRINT(&hop));
ORTE_ACTIVATE_TCP_NO_ROUTE(msg, &hop, mca_oob_tcp_component_no_route);
return;
}
@ [cn22:57290] [[65266,0],1] ACTIVATE PROC [[65266,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
## /open-mpi/ompi/blob/v4.1.x/orte/mca/rml/base/rml_base_frame.c:234
ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_NO_PATH_TO_TARGET);
} else if (ORTE_ERR_ADDRESSEE_UNKNOWN == status) {
ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_PEER_UNKNOWN);
} else {
ORTE_ACTIVATE_PROC_STATE(peer, ORTE_PROC_STATE_UNABLE_TO_SEND_MSG);
}
The PATH looks fine, i believe, so libraries should be available, as required?
/home/mpitestuser/.local/bin:/home/mpitestuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0/bin:/opt/ohpc/pub/compiler/gcc/12.2.0/bin
If i understand it right, it can't find the peer node, but I really can't see, why this fails, because name resolution, ping and ssh as the user executing the mpi job on/to each of the nodes works flawlessly...
Cheers
Hello, just realised i had the wrong netmask for restricting the network "--mca oob_tcp_if_include "10.10.90.0/16"", but changing to "--mca oob_tcp_if_include "10.10.90.0/24"" did not solve the problem, while it still now properly restricted the network traffic as intended:
[cn22:58351] oob:tcp: component_available called
[cn22:58351] [[62077,0],1] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
[cn22:58351] oob:tcp: Found match: 10.10.90.122 (enp7s0)
[cn22:58351] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface lo (not in include list)
[cn22:58351] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface eno1 (not in include list)
[cn22:58351] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
[cn22:58351] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:58351] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
[cn22:58351] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:58351] [[62077,0],1] oob:tcp:init rejecting interface enp33s0f0 (not in include list)
[cn22:58351] [[62077,0],1] TCP STARTUP
It still then keeps failing like this:
[cn22:58351] [[62077,0],1] TCP STARTUP
[cn22:58351] [[62077,0],1] attempting to bind to IPv4 port 0
[cn22:58351] [[62077,0],1] assigned IPv4 port 54033
[cn22:58351] mca:oob:select: Adding component to end
[cn22:58351] mca:oob:select: Found 1 active transports
[cn22:58351] [[62077,0],1]: get transports
[cn22:58351] [[62077,0],1]:get transports for component tcp
[cn22:58351] mca: base: components_register: registering framework odls components
[cn22:58351] mca: base: components_register: found loaded component default
[cn22:58351] mca: base: components_register: component default register function successful
[cn22:58351] mca: base: components_register: found loaded component pspawn
[cn22:58351] mca: base: components_register: component pspawn has no register or open function
[cn22:58351] mca: base: components_open: opening odls components
[cn22:58351] mca: base: components_open: found loaded component default
[cn22:58351] mca: base: components_open: component default open function successful
[cn22:58351] mca: base: components_open: found loaded component pspawn
[cn22:58351] mca: base: components_open: component pspawn open function successful
[cn22:58351] mca:base:select: Auto-selecting odls components
[cn22:58351] mca:base:select:( odls) Querying component [default]
[cn22:58351] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:58351] mca:base:select:( odls) Querying component [pspawn]
[cn22:58351] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:58351] mca:base:select:( odls) Selected component [default]
[cn22:58351] mca: base: close: component pspawn closed
[cn22:58351] mca: base: close: unloading component pspawn
[cn22:58351] [[62077,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:58351] [[62077,0],1] ext3x:client get on proc [[62077,0],1] key (null)
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 0
[cn22:58351] [[62077,0],1] oob:base:send unknown peer [[62077,0],0]
[cn22:58351] [[62077,0],1] ext3x:client get on proc [[62077,0],0] key opal.puri
[cn22:58351] [[62077,0],1] oob:tcp:send_nb to peer [[62077,0],0]:63 seq = -1
[cn22:58351] [[62077,0],1]:[oob_tcp.c:188] processing send to peer [[62077,0],0]:63 seq_num = -1 hop [[62077,0],0] unknown
[cn22:58351] [[62077,0],1]:[oob_tcp.c:191] post no route to [[62077,0],0]
[cn22:58351] [[62077,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:58351] [[62077,0],1] tcp:no route called for peer [[62077,0],0]
[cn22:58351] [[62077,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 0
( # More of the same... )
[cn22:58351] [[62077,0],1]:[oob_tcp.c:191] post no route to [[62077,0],0]
[cn22:58351] [[62077,0],1] oob:base:send to target [[62077,0],0] - attempt 3
[cn22:58351] [[62077,0],1] ACTIVATE PROC [[62077,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:58351] psquash: flex128 finalize
[cn22:58351] mca: base: close: component ext3x closed
[cn22:58351] mca: base: close: unloading component ext3x
[cn22:58351] mca: base: close: component rsh closed
[cn22:58351] mca: base: close: unloading component rsh
[cn22:58351] mca: base: close: component default closed
[cn22:58351] mca: base: close: unloading component default
[cn22:58351] mca: base: close: unloading component direct
[cn22:58351] [[62077,0],1] TCP SHUTDOWN
[cn22:58351] no hnp or not active
[cn22:58351] [[62077,0],1] TCP SHUTDOWN done
[cn22:58351] mca: base: close: component tcp closed
[cn22:58351] mca: base: close: unloading component tcp
[cn22:58351] mca: base: close: component orted closed
[cn22:58351] mca: base: close: unloading component orted
[cn22:58351] mca: base: close: component weighted closed
[cn22:58351] mca: base: close: unloading component weighted
[cn22:58351] mca: base: close: unloading component linux_ipv6
[cn22:58351] mca: base: close: unloading component posix_ipv4
[cn22:58351] mca: base: close: component dlopen closed
[cn22:58351] mca: base: close: unloading component dlopen
Wait a minute - you have an error on your cmd line:
mpirun -N 1 -n 2 -host cn22 /usr/bin/hostname -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self --mca routed direct --mca pml ucx > result
You put the application (/usr/bin/hostname
) right in the middle of the command, which means that the rest of the cmd line is ignored. So all those -x
and --mca
options are being passed to hostname
and not being interpreted by OMPI.
Fix your cmd line and try it again.
Thank you very much for this!
I corrected it as documented in the following, but the problem still seems to be fundamentally the same. I ran the same command in 2 variants: direct and radix. Both fail...
I added some comments where s.th. look striking.
At tis point, because
it appears to me that actually the OOB components is in trouble?
Maybe i misread it, but this section puzzles me the most:
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
Not sure whether these are actual L3 routes, or whether this happens on L7 (processes)? Because on CLI, routing is perfectly fine, and apparently ORTE is using just the single network, as specified in the command...
Strange to me...
Here the details:
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun -N 1 -n 2 -host cn22 --mca oob_tcp_if_include "10.10.90.0/24" --mca oob_base_verbose "100" --mca pml "ucx" -x UCX_NET_DEVICES=enp7s0 -x UCX_TLS=tcp,sm,self /usr/bin/hostname > result
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun",
["/opt/ohpc/pub/mpi/openmpi4-gnu12"...,
"-N", "1", "-n", "2", "-host", "cn22",
"--mca", "oob_tcp_if_include", "10.10.90.0/24",
"--mca", "oob_base_verbose", "100",
"--mca", "pml", "ucx",
"-x", "UCX_NET_DEVICES=enp7s0",
"-x", "UCX_TLS=tcp,sm,self",
"/usr/bin/hostname"], 0x7ffd96605130 /* 56 vars */) = 0
[ etc. ]
socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 19
bind(19, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
getsockname(19, {sa_family=AF_NETLINK, nl_pid=359190, nl_groups=00000000}, [12]) = 0
sendto(19, [{nlmsg_len=20, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1708684694, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ...}], 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1388, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_LOOPBACK, ifi_index=if_nametoindex("lo"), ifi_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=7, nla_type=IFLA_IFNAME}, "lo"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 0], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 65536], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 0], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 0], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\xf8\xff\x07\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=12, nla_type=IFLA_QDISC}, "noqueue"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 0], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 00:00:00:00:00:00], [{nla_len=10, nla_type=IFLA_BROADCAST}, 00:00:00:00:00:00], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=776, nla_type=IFLA_AF_SPEC}, [[{nla_len=136, nla_type=AF_INET}, [{nla_len=132, nla_type=IFLA_INET_CONF}, [[IPV4_DEVCONF_FORWARDING-1] = 0, [IPV4_DEVCONF_MC_FORWARDING-1] = 0, [IPV4_DEVCONF_PROXY_ARP-1] = 0, [IPV4_DEVCONF_ACCEPT_REDIRECTS-1] = 1, [IPV4_DEVCONF_SECURE_REDIRECTS-1] = 1, [IPV4_DEVCONF_SEND_REDIRECTS-1] = 1, [IPV4_DEVCONF_SHARED_MEDIA-1] = 1, [IPV4_DEVCONF_RP_FILTER-1] = 1, [IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE-1] = 0, [IPV4_DEVCONF_BOOTP_RELAY-1] = 0, [IPV4_DEVCONF_LOG_MARTIANS-1] = 0, [IPV4_DEVCONF_TAG-1] = 0, [IPV4_DEVCONF_ARPFILTER-1] = 0, [IPV4_DEVCONF_MEDIUM_ID-1] = 0, [IPV4_DEVCONF_NOXFRM-1] = 1, [IPV4_DEVCONF_NOPOLICY-1] = 1, [IPV4_DEVCONF_FORCE_IGMP_VERSION-1] = 0, [IPV4_DEVCONF_ARP_ANNOUNCE-1] = 0, [IPV4_DEVCONF_ARP_IGNORE-1] = 0, [IPV4_DEVCONF_PROMOTE_SECONDARIES-1] = 1, [IPV4_DEVCONF_ARP_ACCEPT-1] = 0, [IPV4_DEVCONF_ARP_NOTIFY-1] = 0, [IPV4_DEVCONF_ACCEPT_LOCAL-1] = 0, [IPV4_DEVCONF_SRC_VMARK-1] = 0, [IPV4_DEVCONF_PROXY_ARP_PVLAN-1] = 0, [IPV4_DEVCONF_ROUTE_LOCALNET-1] = 0, [IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL-1] = 10000, [IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL-1] = 1000, [IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN-1] = 0, [IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST-1] = 0, [IPV4_DEVCONF_DROP_GRATUITOUS_ARP-1] = 0, [IPV4_DEVCONF_BC_FORWARDING-1] = 0]]], [{nla_len=636, nla_type=AF_INET6}, [[{nla_len=8, nla_type=IFLA_INET6_FLAGS}, IF_READY], [{nla_len=20, nla_type=IFLA_INET6_CACHEINFO}, {max_reasm_len=65535, tstamp=76376014, reachable_time=24780, retrans_time=1000}], [{nla_len=216, nla_type=IFLA_INET6_CONF}, [[DEVCONF_FORWARDING] = 0, [DEVCONF_HOPLIMIT] = 64, [DEVCONF_MTU6] = 65536, [DEVCONF_ACCEPT_RA] = 0, [DEVCONF_ACCEPT_REDIRECTS] = 1, [DEVCONF_AUTOCONF] = 1, [DEVCONF_DAD_TRANSMITS] = 1, [DEVCONF_RTR_SOLICITS] = -1, [DEVCONF_RTR_SOLICIT_INTERVAL] = 4000, [DEVCONF_RTR_SOLICIT_DELAY] = 1000, [DEVCONF_USE_TEMPADDR] = 0, [DEVCONF_TEMP_VALID_LFT] = 604800, [DEVCONF_TEMP_PREFERED_LFT] = 86400, [DEVCONF_REGEN_MAX_RETRY] = 3, [DEVCONF_MAX_DESYNC_FACTOR] = 600, [DEVCONF_MAX_ADDRESSES] = 16, [DEVCONF_FORCE_MLD_VERSION] = 0, [DEVCONF_ACCEPT_RA_DEFRTR] = 1, [DEVCONF_ACCEPT_RA_PINFO] = 1, [DEVCONF_ACCEPT_RA_RTR_PREF] = 1, [DEVCONF_RTR_PROBE_INTERVAL] = 60000, [DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN] = 0, [DEVCONF_PROXY_NDP] = 0, [DEVCONF_OPTIMISTIC_DAD] = 0, [DEVCONF_ACCEPT_SOURCE_ROUTE] = 0, [DEVCONF_MC_FORWARDING] = 0, [DEVCONF_DISABLE_IPV6] = 0, [DEVCONF_ACCEPT_DAD] = -1, [DEVCONF_FORCE_TLLAO] = 0, [DEVCONF_NDISC_NOTIFY] = 0, [DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL] = 10000, [DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL] = 1000, ...]], [{nla_len=300, nla_type=IFLA_INET6_STATS}, [[IPSTATS_MIB_NUM] = 37, [IPSTATS_MIB_INPKTS] = 3, [IPSTATS_MIB_INOCTETS] = 147, [IPSTATS_MIB_INDELIVERS] = 3, [IPSTATS_MIB_OUTFORWDATAGRAMS] = 0, [IPSTATS_MIB_OUTPKTS] = 3, [IPSTATS_MIB_OUTOCTETS] = 147, [IPSTATS_MIB_INHDRERRORS] = 0, [IPSTATS_MIB_INTOOBIGERRORS] = 0, [IPSTATS_MIB_INNOROUTES] = 0, [IPSTATS_MIB_INADDRERRORS] = 0, [IPSTATS_MIB_INUNKNOWNPROTOS] = 0, [IPSTATS_MIB_INTRUNCATEDPKTS] = 0, [IPSTATS_MIB_INDISCARDS] = 0, [IPSTATS_MIB_OUTDISCARDS] = 0, [IPSTATS_MIB_OUTNOROUTES] = 0, [IPSTATS_MIB_REASMTIMEOUT] = 0, [IPSTATS_MIB_REASMREQDS] = 0, [IPSTATS_MIB_REASMOKS] = 0, [IPSTATS_MIB_REASMFAILS] = 0, [IPSTATS_MIB_FRAGOKS] = 0, [IPSTATS_MIB_FRAGFAILS] = 0, [IPSTATS_MIB_FRAGCREATES] = 0, [IPSTATS_MIB_INMCASTPKTS] = 0, [IPSTATS_MIB_OUTMCASTPKTS] = 0, [IPSTATS_MIB_INBCASTPKTS] = 0, [IPSTATS_MIB_OUTBCASTPKTS] = 0, [IPSTATS_MIB_INMCASTOCTETS] = 0, [IPSTATS_MIB_OUTMCASTOCTETS] = 0, [IPSTATS_MIB_INBCASTOCTETS] = 0, [IPSTATS_MIB_OUTBCASTOCTETS] = 0, [IPSTATS_MIB_CSUMERRORS] = 0, ...]], [{nla_len=60, nla_type=IFLA_INET6_ICMP6STATS}, [[ICMP6_MIB_NUM] = 7, [ICMP6_MIB_INMSGS] = 0, [ICMP6_MIB_INERRORS] = 0, [ICMP6_MIB_OUTMSGS] = 0, [ICMP6_MIB_OUTERRORS] = 0, [ICMP6_MIB_CSUMERRORS] = 0, [6 /* ICMP6_MIB_??? */] = 0]], [{nla_len=20, nla_type=IFLA_INET6_TOKEN}, inet_pton(AF_INET6, "::")], [{nla_len=5, nla_type=IFLA_INET6_ADDR_GEN_MODE}, IN6_ADDR_GEN_MODE_NONE]]]]], ...]], [{nlmsg_len=1432, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp1s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp1s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 1500], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 65535], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=13, nla_type=IFLA_QDISC}, "fq_codel"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 1], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 52:54:00:04:96:11], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3355897, tx_packets=2327858, rx_bytes=2231387175, tx_bytes=464981524, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3355897, tx_packets=2327858, rx_bytes=2231387175, tx_bytes=464981524, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 52:54:00:04:96:11], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2820
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1428, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp7s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp7s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 9000], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 9702], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 16], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 16], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=7, nla_type=IFLA_QDISC}, "mq"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 3], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 24:6e:96:37:91:a0], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=29880428, tx_packets=91010231, rx_bytes=5677654305, tx_bytes=139188405615, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=29880428, tx_packets=91010231, rx_bytes=1382687009, tx_bytes=1749452143, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 24:6e:96:37:91:a0], ...]], [{nlmsg_len=1428, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp8s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp8s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 9000], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 9702], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 16], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 16], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=7, nla_type=IFLA_QDISC}, "mq"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 3], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 24:6e:96:37:91:b0], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3064344, tx_packets=2705253, rx_bytes=235395248, tx_bytes=200285367, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3064344, tx_packets=2705253, rx_bytes=235395248, tx_bytes=200285367, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 24:6e:96:37:91:b0], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2856
close(19) = 0
socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_ROUTE) = 19
bind(19, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 0
getsockname(19, {sa_family=AF_NETLINK, nl_pid=359190, nl_groups=00000000}, [12]) = 0
sendto(19, [{nlmsg_len=20, nlmsg_type=RTM_GETLINK, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1708684694, nlmsg_pid=0}, {ifi_family=AF_UNSPEC, ...}], 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(19, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=1388, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_LOOPBACK, ifi_index=if_nametoindex("lo"), ifi_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=7, nla_type=IFLA_IFNAME}, "lo"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 0], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 65536], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 0], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 0], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\xf8\xff\x07\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=12, nla_type=IFLA_QDISC}, "noqueue"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 0], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 0], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 00:00:00:00:00:00], [{nla_len=10, nla_type=IFLA_BROADCAST}, 00:00:00:00:00:00], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=92358, tx_packets=92358, rx_bytes=8296122, tx_bytes=8296122, rx_errors=0, tx_errors=0, rx_dropped=0, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=776, nla_type=IFLA_AF_SPEC}, [[{nla_len=136, nla_type=AF_INET}, [{nla_len=132, nla_type=IFLA_INET_CONF}, [[IPV4_DEVCONF_FORWARDING-1] = 0, [IPV4_DEVCONF_MC_FORWARDING-1] = 0, [IPV4_DEVCONF_PROXY_ARP-1] = 0, [IPV4_DEVCONF_ACCEPT_REDIRECTS-1] = 1, [IPV4_DEVCONF_SECURE_REDIRECTS-1] = 1, [IPV4_DEVCONF_SEND_REDIRECTS-1] = 1, [IPV4_DEVCONF_SHARED_MEDIA-1] = 1, [IPV4_DEVCONF_RP_FILTER-1] = 1, [IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE-1] = 0, [IPV4_DEVCONF_BOOTP_RELAY-1] = 0, [IPV4_DEVCONF_LOG_MARTIANS-1] = 0, [IPV4_DEVCONF_TAG-1] = 0, [IPV4_DEVCONF_ARPFILTER-1] = 0, [IPV4_DEVCONF_MEDIUM_ID-1] = 0, [IPV4_DEVCONF_NOXFRM-1] = 1, [IPV4_DEVCONF_NOPOLICY-1] = 1, [IPV4_DEVCONF_FORCE_IGMP_VERSION-1] = 0, [IPV4_DEVCONF_ARP_ANNOUNCE-1] = 0, [IPV4_DEVCONF_ARP_IGNORE-1] = 0, [IPV4_DEVCONF_PROMOTE_SECONDARIES-1] = 1, [IPV4_DEVCONF_ARP_ACCEPT-1] = 0, [IPV4_DEVCONF_ARP_NOTIFY-1] = 0, [IPV4_DEVCONF_ACCEPT_LOCAL-1] = 0, [IPV4_DEVCONF_SRC_VMARK-1] = 0, [IPV4_DEVCONF_PROXY_ARP_PVLAN-1] = 0, [IPV4_DEVCONF_ROUTE_LOCALNET-1] = 0, [IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL-1] = 10000, [IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL-1] = 1000, [IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN-1] = 0, [IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST-1] = 0, [IPV4_DEVCONF_DROP_GRATUITOUS_ARP-1] = 0, [IPV4_DEVCONF_BC_FORWARDING-1] = 0]]], [{nla_len=636, nla_type=AF_INET6}, [[{nla_len=8, nla_type=IFLA_INET6_FLAGS}, IF_READY], [{nla_len=20, nla_type=IFLA_INET6_CACHEINFO}, {max_reasm_len=65535, tstamp=76376014, reachable_time=24780, retrans_time=1000}], [{nla_len=216, nla_type=IFLA_INET6_CONF}, [[DEVCONF_FORWARDING] = 0, [DEVCONF_HOPLIMIT] = 64, [DEVCONF_MTU6] = 65536, [DEVCONF_ACCEPT_RA] = 0, [DEVCONF_ACCEPT_REDIRECTS] = 1, [DEVCONF_AUTOCONF] = 1, [DEVCONF_DAD_TRANSMITS] = 1, [DEVCONF_RTR_SOLICITS] = -1, [DEVCONF_RTR_SOLICIT_INTERVAL] = 4000, [DEVCONF_RTR_SOLICIT_DELAY] = 1000, [DEVCONF_USE_TEMPADDR] = 0, [DEVCONF_TEMP_VALID_LFT] = 604800, [DEVCONF_TEMP_PREFERED_LFT] = 86400, [DEVCONF_REGEN_MAX_RETRY] = 3, [DEVCONF_MAX_DESYNC_FACTOR] = 600, [DEVCONF_MAX_ADDRESSES] = 16, [DEVCONF_FORCE_MLD_VERSION] = 0, [DEVCONF_ACCEPT_RA_DEFRTR] = 1, [DEVCONF_ACCEPT_RA_PINFO] = 1, [DEVCONF_ACCEPT_RA_RTR_PREF] = 1, [DEVCONF_RTR_PROBE_INTERVAL] = 60000, [DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN] = 0, [DEVCONF_PROXY_NDP] = 0, [DEVCONF_OPTIMISTIC_DAD] = 0, [DEVCONF_ACCEPT_SOURCE_ROUTE] = 0, [DEVCONF_MC_FORWARDING] = 0, [DEVCONF_DISABLE_IPV6] = 0, [DEVCONF_ACCEPT_DAD] = -1, [DEVCONF_FORCE_TLLAO] = 0, [DEVCONF_NDISC_NOTIFY] = 0, [DEVCONF_MLDV1_UNSOLICITED_REPORT_INTERVAL] = 10000, [DEVCONF_MLDV2_UNSOLICITED_REPORT_INTERVAL] = 1000, ...]], [{nla_len=300, nla_type=IFLA_INET6_STATS}, [[IPSTATS_MIB_NUM] = 37, [IPSTATS_MIB_INPKTS] = 3, [IPSTATS_MIB_INOCTETS] = 147, [IPSTATS_MIB_INDELIVERS] = 3, [IPSTATS_MIB_OUTFORWDATAGRAMS] = 0, [IPSTATS_MIB_OUTPKTS] = 3, [IPSTATS_MIB_OUTOCTETS] = 147, [IPSTATS_MIB_INHDRERRORS] = 0, [IPSTATS_MIB_INTOOBIGERRORS] = 0, [IPSTATS_MIB_INNOROUTES] = 0, [IPSTATS_MIB_INADDRERRORS] = 0, [IPSTATS_MIB_INUNKNOWNPROTOS] = 0, [IPSTATS_MIB_INTRUNCATEDPKTS] = 0, [IPSTATS_MIB_INDISCARDS] = 0, [IPSTATS_MIB_OUTDISCARDS] = 0, [IPSTATS_MIB_OUTNOROUTES] = 0, [IPSTATS_MIB_REASMTIMEOUT] = 0, [IPSTATS_MIB_REASMREQDS] = 0, [IPSTATS_MIB_REASMOKS] = 0, [IPSTATS_MIB_REASMFAILS] = 0, [IPSTATS_MIB_FRAGOKS] = 0, [IPSTATS_MIB_FRAGFAILS] = 0, [IPSTATS_MIB_FRAGCREATES] = 0, [IPSTATS_MIB_INMCASTPKTS] = 0, [IPSTATS_MIB_OUTMCASTPKTS] = 0, [IPSTATS_MIB_INBCASTPKTS] = 0, [IPSTATS_MIB_OUTBCASTPKTS] = 0, [IPSTATS_MIB_INMCASTOCTETS] = 0, [IPSTATS_MIB_OUTMCASTOCTETS] = 0, [IPSTATS_MIB_INBCASTOCTETS] = 0, [IPSTATS_MIB_OUTBCASTOCTETS] = 0, [IPSTATS_MIB_CSUMERRORS] = 0, ...]], [{nla_len=60, nla_type=IFLA_INET6_ICMP6STATS}, [[ICMP6_MIB_NUM] = 7, [ICMP6_MIB_INMSGS] = 0, [ICMP6_MIB_INERRORS] = 0, [ICMP6_MIB_OUTMSGS] = 0, [ICMP6_MIB_OUTERRORS] = 0, [ICMP6_MIB_CSUMERRORS] = 0, [6 /* ICMP6_MIB_??? */] = 0]], [{nla_len=20, nla_type=IFLA_INET6_TOKEN}, inet_pton(AF_INET6, "::")], [{nla_len=5, nla_type=IFLA_INET6_ADDR_GEN_MODE}, IN6_ADDR_GEN_MODE_NONE]]]]], ...]], [{nlmsg_len=1432, nlmsg_type=RTM_NEWLINK, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1708684694, nlmsg_pid=359190}, {ifi_family=AF_UNSPEC, ifi_type=ARPHRD_ETHER, ifi_index=if_nametoindex("enp1s0"), ifi_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST|IFF_LOWER_UP, ifi_change=0}, [[{nla_len=11, nla_type=IFLA_IFNAME}, "enp1s0"], [{nla_len=8, nla_type=IFLA_TXQLEN}, 1000], [{nla_len=5, nla_type=IFLA_OPERSTATE}, 6], [{nla_len=5, nla_type=IFLA_LINKMODE}, 0], [{nla_len=8, nla_type=IFLA_MTU}, 1500], [{nla_len=8, nla_type=IFLA_MIN_MTU}, 68], [{nla_len=8, nla_type=IFLA_MAX_MTU}, 65535], [{nla_len=8, nla_type=IFLA_GROUP}, 0], [{nla_len=8, nla_type=IFLA_PROMISCUITY}, 0], [{nla_len=8, nla_type=0x3d /* IFLA_??? */}, "\x00\x00\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_TX_QUEUES}, 1], [{nla_len=8, nla_type=IFLA_GSO_MAX_SEGS}, 65535], [{nla_len=8, nla_type=IFLA_GSO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=IFLA_GRO_MAX_SIZE}, 65536], [{nla_len=8, nla_type=0x3f /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x40 /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3b /* IFLA_??? */}, "\x00\x00\x01\x00"], [{nla_len=8, nla_type=0x3c /* IFLA_??? */}, "\xff\xff\x00\x00"], [{nla_len=8, nla_type=IFLA_NUM_RX_QUEUES}, 1], [{nla_len=5, nla_type=IFLA_CARRIER}, 1], [{nla_len=13, nla_type=IFLA_QDISC}, "fq_codel"], [{nla_len=8, nla_type=IFLA_CARRIER_CHANGES}, 2], [{nla_len=8, nla_type=IFLA_CARRIER_UP_COUNT}, 1], [{nla_len=8, nla_type=IFLA_CARRIER_DOWN_COUNT}, 1], [{nla_len=5, nla_type=IFLA_PROTO_DOWN}, 0], [{nla_len=36, nla_type=IFLA_MAP}, {mem_start=0, mem_end=0, base_addr=0, irq=0, dma=0, port=0}], [{nla_len=10, nla_type=IFLA_ADDRESS}, 52:54:00:04:96:11], [{nla_len=10, nla_type=IFLA_BROADCAST}, ff:ff:ff:ff:ff:ff], [{nla_len=196, nla_type=IFLA_STATS64}, {rx_packets=3355910, tx_packets=2327885, rx_bytes=2231388033, tx_bytes=465000502, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=100, nla_type=IFLA_STATS}, {rx_packets=3355910, tx_packets=2327885, rx_bytes=2231388033, tx_bytes=465000502, rx_errors=0, tx_errors=0, rx_dropped=1153810, tx_dropped=0, multicast=0, collisions=0, rx_length_errors=0, rx_over_errors=0, rx_crc_errors=0, rx_frame_errors=0, rx_fifo_errors=0, rx_missed_errors=0, tx_aborted_errors=0, tx_carrier_errors=0, tx_fifo_errors=0, tx_heartbeat_errors=0, tx_window_errors=0, rx_compressed=0, tx_compressed=0, rx_nohandler=0}], [{nla_len=12, nla_type=IFLA_XDP}, [{nla_len=5, nla_type=IFLA_XDP_ATTACHED}, XDP_ATTACHED_NONE]], [{nla_len=10, nla_type=IFLA_PERM_ADDRESS}, 52:54:00:04:96:11], ...]]], iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 2820
close(19) = 0
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 57[mgmt01:359190] [[28544,0],0]: parent -1 num_children 1
) = 57
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 41[mgmt01:359190] [[28544,0],0]: child 1
) = 41
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 56[mgmt01:359190] [[28544,0],0]: parent 0 num_children 1
) = 56
write(2, "[mgmt01:359190] [[28544,0],0]: "..., 41[mgmt01:359190] [[28544,0],0]: child 1
) = 41
getuid() = 2001
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 19
connect(19, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(19) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 19
connect(19, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(19) = 0
newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=2973, ...}, 0) = 0
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 19
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=1542, ...}, AT_EMPTY_PATH) = 0
lseek(19, 0, SEEK_SET) = 0
read(19, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1542
close(19) = 0
write(12, "\1\0\0\0\0\0\0\0", 8) = 8
futex(0x1c0bb30, FUTEX_WAKE_PRIVATE, 1) = 1
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
## reformatted for better readability
write(2, "[mgmt01:359190] [[28544,0],0] p"..., 911[mgmt01:359190] [[28544,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ;
export PATH ; LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ;
export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ;
export DYLD_LIBRARY_PATH ;
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted
-mca ess "env"
-mca ess_base_jobid "1870659584"
-mca ess_base_vpid "<template>"
-mca ess_base_num_procs "2"
-mca orte_node_regex "mgmt[2:1],cn[2:22]@0(2)"
-mca orte_hnp_uri "1870659584.0;tcp://10.10.90.100:39333"
--mca oob_tcp_if_include "10.10.90.0/24"
--mca oob_base_verbose "100"
--mca pml "ucx"
-mca plm "rsh"
--tree-spawn
-mca routed "radix"
-mca orte_parent_uri "1870659584.0;tcp://10.10.90.100:39333"
-mca rmaps_ppr_n_pernode "1"
-mca pmix "^s1,s2,cray,isolated"
) = 911
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2ca440aa10) = 359193
setpgid(359193, 359193) = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
[cn22:63069] mca: base: components_register: registering framework dl components
[cn22:63069] mca: base: components_register: found loaded component dlopen
[cn22:63069] mca: base: components_register: component dlopen register function successful
[cn22:63069] mca: base: components_open: opening dl components
[cn22:63069] mca: base: components_open: found loaded component dlopen
[cn22:63069] mca: base: components_open: component dlopen open function successful
[cn22:63069] mca:base:select: Auto-selecting dl components
[cn22:63069] mca:base:select:( dl) Querying component [dlopen]
[cn22:63069] mca:base:select:( dl) Query of component [dlopen] set priority to 80
[cn22:63069] mca:base:select:( dl) Selected component [dlopen]
[cn22:63069] mca: base: components_register: registering framework if components
[cn22:63069] mca: base: components_register: found loaded component linux_ipv6
[cn22:63069] mca: base: components_register: component linux_ipv6 has no register or open function
[cn22:63069] mca: base: components_register: found loaded component posix_ipv4
[cn22:63069] mca: base: components_register: component posix_ipv4 has no register or open function
[cn22:63069] mca: base: components_open: opening if components
[cn22:63069] mca: base: components_open: found loaded component linux_ipv6
[cn22:63069] mca: base: components_open: component linux_ipv6 open function successful
[cn22:63069] mca: base: components_open: found loaded component posix_ipv4
[cn22:63069] found interface lo
[cn22:63069] found interface eno1
[cn22:63069] found interface enp193s0f0
[cn22:63069] found interface enp7s0
[cn22:63069] found interface enp1s0f0
[cn22:63069] found interface enp33s0f0
[cn22:63069] mca: base: components_open: component posix_ipv4 open function successful
[cn22:63069] mca: base: components_register: registering framework reachable components
[cn22:63069] mca: base: components_register: found loaded component weighted
[cn22:63069] mca: base: components_register: component weighted register function successful
[cn22:63069] mca: base: components_open: opening reachable components
[cn22:63069] mca: base: components_open: found loaded component weighted
[cn22:63069] mca: base: components_open: component weighted open function successful
[cn22:63069] mca:base:select: Auto-selecting reachable components
[cn22:63069] mca:base:select:(reachable) Querying component [weighted]
[cn22:63069] mca:base:select:(reachable) Query of component [weighted] set priority to 1
[cn22:63069] mca:base:select:(reachable) Selected component [weighted]
[cn22:63069] mca: base: components_register: registering framework state components
[cn22:63069] mca: base: components_register: found loaded component tool
[cn22:63069] mca: base: components_register: component tool has no register or open function
[cn22:63069] mca: base: components_register: found loaded component orted
[cn22:63069] mca: base: components_register: component orted has no register or open function
[cn22:63069] mca: base: components_register: found loaded component hnp
[cn22:63069] mca: base: components_register: component hnp has no register or open function
[cn22:63069] mca: base: components_register: found loaded component app
[cn22:63069] mca: base: components_register: component app has no register or open function
[cn22:63069] mca: base: components_register: found loaded component novm
[cn22:63069] mca: base: components_register: component novm has no register or open function
[cn22:63069] mca: base: components_open: opening state components
[cn22:63069] mca: base: components_open: found loaded component tool
[cn22:63069] mca: base: components_open: component tool open function successful
[cn22:63069] mca: base: components_open: found loaded component orted
[cn22:63069] mca: base: components_open: component orted open function successful
[cn22:63069] mca: base: components_open: found loaded component hnp
[cn22:63069] mca: base: components_open: component hnp open function successful
[cn22:63069] mca: base: components_open: found loaded component app
[cn22:63069] mca: base: components_open: component app open function successful
[cn22:63069] mca: base: components_open: found loaded component novm
[cn22:63069] mca: base: components_open: component novm open function successful
[cn22:63069] mca:base:select: Auto-selecting state components
[cn22:63069] mca:base:select:(state) Querying component [tool]
[cn22:63069] mca:base:select:(state) Querying component [orted]
[cn22:63069] mca:base:select:(state) Query of component [orted] set priority to 100
[cn22:63069] mca:base:select:(state) Querying component [hnp]
[cn22:63069] mca:base:select:(state) Querying component [app]
[cn22:63069] mca:base:select:(state) Querying component [novm]
[cn22:63069] mca:base:select:(state) Selected component [orted]
[cn22:63069] mca: base: close: component tool closed
[cn22:63069] mca: base: close: unloading component tool
[cn22:63069] mca: base: close: component hnp closed
[cn22:63069] mca: base: close: unloading component hnp
[cn22:63069] mca: base: close: component app closed
[cn22:63069] mca: base: close: unloading component app
[cn22:63069] mca: base: close: component novm closed
[cn22:63069] mca: base: close: unloading component novm
[cn22:63069] ORTE_JOB_STATE_MACHINE:
[cn22:63069] State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[cn22:63069] State: FORCED EXIT cbfunc: DEFINED
[cn22:63069] State: DAEMONS TERMINATED cbfunc: DEFINED
[cn22:63069] ORTE_PROC_STATE_MACHINE:
[cn22:63069] State: RUNNING cbfunc: DEFINED
[cn22:63069] State: SYNC REGISTERED cbfunc: DEFINED
[cn22:63069] State: IOF COMPLETE cbfunc: DEFINED
[cn22:63069] State: WAITPID FIRED cbfunc: DEFINED
[cn22:63069] State: NORMALLY TERMINATED cbfunc: DEFINED
[cn22:63069] mca: base: components_register: registering framework plm components
[cn22:63069] mca: base: components_register: found loaded component rsh
[cn22:63069] mca: base: components_register: component rsh register function successful
[cn22:63069] mca: base: components_open: opening plm components
[cn22:63069] mca: base: components_open: found loaded component rsh
[cn22:63069] mca: base: components_open: component rsh open function successful
[cn22:63069] mca:base:select: Auto-selecting plm components
[cn22:63069] mca:base:select:( plm) Querying component [rsh]
[cn22:63069] mca:base:select:( plm) Query of component [rsh] set priority to 10
[cn22:63069] mca:base:select:( plm) Selected component [rsh]
[cn22:63069] mca: base: components_register: registering framework pmix components
[cn22:63069] mca: base: components_register: found loaded component flux
[cn22:63069] mca: base: components_register: component flux register function successful
[cn22:63069] mca: base: components_register: found loaded component ext3x
[cn22:63069] mca: base: components_register: component ext3x register function successful
[cn22:63069] mca: base: components_open: opening pmix components
[cn22:63069] mca: base: components_open: found loaded component flux
[cn22:63069] mca: base: components_open: found loaded component ext3x
[cn22:63069] mca: base: components_open: component ext3x open function successful
[cn22:63069] mca:base:select: Auto-selecting pmix components
[cn22:63069] mca:base:select:( pmix) Querying component [flux]
[cn22:63069] mca:base:select:( pmix) Querying component [ext3x]
[cn22:63069] mca:base:select:( pmix) Query of component [ext3x] set priority to 5
[cn22:63069] mca:base:select:( pmix) Selected component [ext3x]
[cn22:63069] mca: base: close: unloading component flux
[cn22:63069] psquash: flex128 init
[cn22:63069] psquash: native init
[cn22:63069] psquash: flex128 init
[cn22:63069] PMIX server errreg_cbfunc - error handler registered status=0, reference=1
[cn22:63069] mca: base: components_register: registering framework routed components
[cn22:63069] mca: base: components_register: found loaded component radix
[cn22:63069] mca: base: components_register: component radix register function successful
[cn22:63069] mca: base: components_open: opening routed components
[cn22:63069] mca: base: components_open: found loaded component radix
[cn22:63069] orte_routed_base_select: Initializing routed component radix
[cn22:63069] [[28544,0],1]: Final routed priorities
[cn22:63069] Component: radix Priority: 70
[cn22:63069] mca: base: components_register: registering framework oob components
[cn22:63069] mca: base: components_register: found loaded component tcp
[cn22:63069] mca: base: components_register: component tcp register function successful
[cn22:63069] mca: base: components_open: opening oob components
[cn22:63069] mca: base: components_open: found loaded component tcp
[cn22:63069] mca: base: components_open: component tcp open function successful
[cn22:63069] mca:oob:select: checking available component tcp
[cn22:63069] mca:oob:select: Querying component [tcp]
[cn22:63069] oob:tcp: component_available called
[cn22:63069] [[28544,0],1] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
[cn22:63069] oob:tcp: Found match: 10.10.90.122 (enp7s0)
[cn22:63069] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface lo (not in include list)
[cn22:63069] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface eno1 (not in include list)
[cn22:63069] WORKING INTERFACE 3 KERNEL INDEX 3 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
[cn22:63069] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init adding 10.10.90.122 to our list of V4 connections
[cn22:63069] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
[cn22:63069] WORKING INTERFACE 6 KERNEL INDEX 9 FAMILY: V4
[cn22:63069] [[28544,0],1] oob:tcp:init rejecting interface enp33s0f0 (not in include list)
[cn22:63069] [[28544,0],1] TCP STARTUP
[cn22:63069] [[28544,0],1] attempting to bind to IPv4 port 0
[cn22:63069] [[28544,0],1] assigned IPv4 port 52207
[cn22:63069] mca:oob:select: Adding component to end
[cn22:63069] mca:oob:select: Found 1 active transports
[cn22:63069] [[28544,0],1]: get transports
[cn22:63069] [[28544,0],1]:get transports for component tcp
[cn22:63069] mca: base: components_register: registering framework odls components
[cn22:63069] mca: base: components_register: found loaded component default
[cn22:63069] mca: base: components_register: component default register function successful
[cn22:63069] mca: base: components_register: found loaded component pspawn
[cn22:63069] mca: base: components_register: component pspawn has no register or open function
[cn22:63069] mca: base: components_open: opening odls components
[cn22:63069] mca: base: components_open: found loaded component default
[cn22:63069] mca: base: components_open: component default open function successful
[cn22:63069] mca: base: components_open: found loaded component pspawn
[cn22:63069] mca: base: components_open: component pspawn open function successful
[cn22:63069] mca:base:select: Auto-selecting odls components
[cn22:63069] mca:base:select:( odls) Querying component [default]
[cn22:63069] mca:base:select:( odls) Query of component [default] set priority to 10
[cn22:63069] mca:base:select:( odls) Querying component [pspawn]
[cn22:63069] mca:base:select:( odls) Query of component [pspawn] set priority to 1
[cn22:63069] mca:base:select:( odls) Selected component [default]
At this point it seems to fail (the following log line is a direct continuation from the previous log line):
[cn22:63069] mca: base: close: component pspawn closed
[cn22:63069] mca: base: close: unloading component pspawn
[cn22:63069] [[28544,0],1]: parent 0 num_children 0
[cn22:63069] [[28544,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],1] key (null)
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 0
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 0
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 1
[cn22:63069] [[28544,0],1] oob:base:send known transport for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 1
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 2
[cn22:63069] [[28544,0],1] oob:base:send known transport for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:63 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:63 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] tcp:no route called for peer [[28544,0],0]
[cn22:63069] [[28544,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 2
[cn22:63069] [[28544,0],1] oob:base:send unknown peer [[28544,0],0]
[cn22:63069] [[28544,0],1] ext3x:client get on proc [[28544,0],0] key opal.puri
[cn22:63069] [[28544,0],1] oob:tcp:send_nb to peer [[28544,0],0]:10 seq = -1
[cn22:63069] [[28544,0],1]:[oob_tcp.c:188] processing send to peer [[28544,0],0]:10 seq_num = -1 hop [[28544,0],0] unknown
[cn22:63069] [[28544,0],1]:[oob_tcp.c:191] post no route to [[28544,0],0]
[cn22:63069] [[28544,0],1] oob:base:send to target [[28544,0],0] - attempt 3
[cn22:63069] [[28544,0],1] ACTIVATE PROC [[28544,0],0] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn22:63069] psquash: flex128 finalize
[cn22:63069] mca: base: close: component ext3x closed
[cn22:63069] mca: base: close: unloading component ext3x
[cn22:63069] mca: base: close: component rsh closed
[cn22:63069] mca: base: close: unloading component rsh
[cn22:63069] mca: base: close: component default closed
[cn22:63069] mca: base: close: unloading component default
[cn22:63069] mca: base: close: unloading component radix
[cn22:63069] [[28544,0],1] TCP SHUTDOWN
[cn22:63069] no hnp or not active
[cn22:63069] [[28544,0],1] TCP SHUTDOWN done
[cn22:63069] mca: base: close: component tcp closed
[cn22:63069] mca: base: close: unloading component tcp
[cn22:63069] mca: base: close: component orted closed
[cn22:63069] mca: base: close: unloading component orted
[cn22:63069] mca: base: close: component weighted closed
[cn22:63069] mca: base: close: unloading component weighted
[cn22:63069] mca: base: close: unloading component linux_ipv6
[cn22:63069] mca: base: close: unloading component posix_ipv4
[cn22:63069] mca: base: close: component dlopen closed
[cn22:63069] mca: base: close: unloading component dlopen
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=359193, si_uid=2001, si_status=1, si_utime=5, si_stime=1} ---
write(4, "\21", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024) = 1
read(3, 0x7f2ca444c360, 1024) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 359193
wait4(-1, 0x7ffc3d7ebd44, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
write(2, "[mgmt01:359190] [[28544,0],0] A"..., 105[mgmt01:359190] [[28544,0],0] ACTIVATE PROC [[28544,0],1] STATE FAILED TO START AT plm_rsh_module.c:318
) = 105
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 19
ioctl(19, TCGETS, 0x7ffc3d7eba90) = -1 ENOTTY (Inappropriate ioctl for device)
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=4147, ...}, AT_EMPTY_PATH) = 0
read(19, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(19, "", 4096) = 0
close(19) = 0
write(2, "--------------------------------"..., 1137--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
[ etc. : the typical error message, ]
I'm afraid you are misunderstanding the error message - this has nothing to do with the network. The OOB is complaining that it was never given the connection information for calling back to mpirun
. Hence, it has no way of connecting back. The question is why wasn't it given the info?
You might look at the output from --mca plm_base_verbose 5
and see what the ssh command line looks like - it should be given there.
Thank you again for your time and help!
OOB is complaining that it was never given the connection information
Ahh, i understand. I would have preferred to debug a L2 problem ;-)
According to your proposal i repeated the mpirun with better logging, and also added some "-x":
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun
--nolocal
--mca plm_base_verbose 5
--mca oob_tcp_if_include "10.10.90.0/24"
--mca oob "tcp"
--mca btl "tcp,vader,self,sm"
--mca plm "rsh"
--mca routed "direct"
--mca pml "ucx"
-x UCX_TLS=tcp,vader,sm,self
-x UCX_TCP_AF_PRIO=inet
-x UCX_NET_DEVICES=enp7s0
-x UCX_SHM_DEVICES=enp7s0
-x UCX_ACC_DEVICES=enp7s0
-x UCX_SELF_DEVICES=enp7s0
-x UCX_PROTOS=all
-x UCX_SOCKADDR_TLS_PRIORITY=tcp,sockcm
-x UCX_WARN_INVALID_CONFIG=y
-x UCX_ADDRESS_DEBUG_INFO=y
--host cn22
-np 2
/usr/bin/hostname
> result
The resulting SSH setup:
write(2, "[cn21:77506] [[9774,0],0] plm"..., 1138[cn21:77506] [[9774,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template>
PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted
-mca ess "env"
-mca ess_base_jobid "640548864"
-mca ess_base_vpid "<template>"
-mca ess_base_num_procs "2"
-mca orte_node_regex "cn[2:21-22]@0(2)"
-mca orte_hnp_uri "640548864.0;tcp://10.10.90.121:45851"
--mca plm_base_verbose "5"
--mca oob_tcp_if_include "10.10.90.0/24"
--mca oob "tcp"
--mca btl "tcp,vader,self,sm"
--mca pml "ucx"
-mca plm "rsh"
--tree-spawn
-mca routed "direct"
-mca orte_parent_uri "640548864.0;tcp://10.10.90.121:45851"
-mca hwloc_base_report_bindings "1"
-mca orte_display_alloc "1"
-mca rmaps_base_no_schedule_local "1"
-mca pmix "^s1,s2,cray,isolated"</mark>
) = 1138
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff529d04a10) = 77509
setpgid(77509, 77509) = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
Restriction of network works:
write(2, "[cn21:77506] mca: base: compo"..., 85[cn21:77506] mca: base: components_register: registering framework oob components
write(2, "[cn21:77506] mca: base: compo"..., 75[cn21:77506] mca: base: components_register: found loaded component tcp
write(2, "[cn21:77506] mca: base: compo"..., 91[cn21:77506] mca: base: components_register: component tcp register function successful
write(2, "[cn21:77506] mca: base: compo"..., 67[cn21:77506] mca: base: components_open: opening oob components
write(2, "[cn21:77506] mca: base: compo"..., 71[cn21:77506] mca: base: components_open: found loaded component tcp
write(2, "[cn21:77506] mca: base: compo"..., 83[cn21:77506] mca: base: components_open: component tcp open function successful
write(2, "[cn21:77506] mca:oob:select: "..., 65[cn21:77506] mca:oob:select: checking available component tcp
write(2, "[cn21:77506] mca:oob:select: "..., 57[cn21:77506] mca:oob:select: Querying component [tcp]
write(2, "[cn21:77506] oob:tcp: compone"..., 52[cn21:77506] oob:tcp: component_available called
write(2, "[cn21:77506] [[9774,0],0] oob"..., 92[cn21:77506] [[9774,0],0] oob:tcp: Searching for include address+prefix: 10.10.90.0 / 24
write(2, "[cn21:77506] oob:tcp: Found m"..., 60[cn21:77506] oob:tcp: Found match: 10.10.90.121 (enp7s0)
write(2, "[cn21:77506] WORKING INTERFAC"..., 62[cn21:77506] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
write(2, "[cn21:77506] [[9774,0],0] oob"..., 87[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface lo (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 89[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface eno1 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 95[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp193s0f0 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 92[cn21:77506] [[9774,0],0] oob:tcp:init adding 10.10.90.121 to our list of V4 connections
write(2, "[cn21:77506] [[9774,0],0] oob"..., 93[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp1s0f0 (not in include list)
write(2, "[cn21:77506] [[9774,0],0] oob"..., 94[cn21:77506] [[9774,0],0] oob:tcp:init rejecting interface enp33s0f0 (not in include list)
Loading of mca components all successful
[cn22:70820] mca: base: components_register: registering framework plm components
[cn22:70820] mca: base: components_register: found loaded component rsh
[cn22:70820] mca: base: components_register: component rsh register function successful
[cn22:70820] mca: base: components_open: opening plm components
[cn22:70820] mca: base: components_open: found loaded component rsh
[cn22:70820] mca: base: components_open: component rsh open function successful
[cn22:70820] mca:base:select: Auto-selecting plm components
[cn22:70820] mca:base:select:( plm) Querying component [rsh]
[cn22:70820] mca:base:select:( plm) Query of component [rsh] set priority to 10
[cn22:70820] mca:base:select:( plm) Selected component [rsh]
[cn22:70820] mca: base: components_register: registering framework routed components
[cn22:70820] mca: base: components_register: found loaded component direct
[cn22:70820] mca: base: components_register: component direct has no register or open function
[cn22:70820] mca: base: components_open: opening routed components
[cn22:70820] mca: base: components_open: found loaded component direct
[cn22:70820] orte_routed_base_select: Initializing routed component direct
[cn22:70820] [[9774,0],1]: Final routed priorities
[cn22:70820] Component: direct Priority: 0
[cn22:70820] mca: base: components_register: registering framework oob components
[cn22:70820] mca: base: components_register: found loaded component tcp
[cn22:70820] mca: base: components_register: component tcp register function successful
[cn22:70820] mca: base: components_open: opening oob components
[cn22:70820] mca: base: components_open: found loaded component tcp
[cn22:70820] mca: base: components_open: component tcp open function successful
[cn22:70820] mca:oob:select: checking available component tcp
[cn22:70820] mca:oob:select: Querying component [tcp]
[cn22:70820] oob:tcp: component_available called
TCP session successful:
[cn22:70820] [[9774,0],1] TCP STARTUP
[cn22:70820] [[9774,0],1] attempting to bind to IPv4 port 0
[cn22:70820] [[9774,0],1] assigned IPv4 port 60373
[cn22:70820] mca:oob:select: Adding component to end
[cn22:70820] mca:oob:select: Found 1 active transports
[cn22:70820] [[9774,0],1]: get transports
[cn22:70820] [[9774,0],1]:get transports for component tcp
[cn22:70820] [[9774,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 0
[cn22:70820] [[9774,0],1] oob:base:send unknown peer [[9774,0],0]
[cn22:70820] [[9774,0],1] oob:tcp:send_nb to peer [[9774,0],0]:63 seq = -1
[cn22:70820] [[9774,0],1]:[oob_tcp.c:188] processing send to peer [[9774,0],0]:63 seq_num = -1 hop [[9774,0],0] unknown
[cn22:70820] [[9774,0],1]:[oob_tcp.c:191] post no route to [[9774,0],0]
[cn22:70820] [[9774,0],1] OOB_SEND: rml_oob_send.c:265
[cn22:70820] [[9774,0],1] tcp:no route called for peer [[9774,0],0]
[cn22:70820] [[9774,0],1] OOB_SEND: oob_tcp_component.c:1123
[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 0
[ etc. ]
[cn22:70820] [[9774,0],1] oob:base:send to target [[9774,0],0] - attempt 3
[cn22:70820] mca: base: close: component rsh closed
[cn22:70820] mca: base: close: unloading component rsh
[cn22:70820] mca: base: close: unloading component direct
[cn22:70820] [[9774,0],1] TCP SHUTDOWN
[cn22:70820] no hnp or not active
[cn22:70820] [[9774,0],1] TCP SHUTDOWN done
[cn22:70820] mca: base: close: component tcp closed
[cn22:70820] mca: base: close: unloading component tcp
Then the first errors again:
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=77509, si_uid=2001, si_status=1, si_utime=1, si_stime=0} ---
write(4, "\21", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024) = 1
read(3, 0x7ff529f4f360, 1024) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], WNOHANG, NULL) = 77509
wait4(-1, 0x7ffde776f8c4, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/share/openmpi/help-errmgr-base.txt", O_RDONLY) = 19
ioctl(19, TCGETS, 0x7ffde776f610) = -1 ENOTTY (Inappropriate ioctl for device)
newfstatat(19, "", {st_mode=S_IFREG|0644, st_size=4147, ...}, AT_EMPTY_PATH) = 0
read(19, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 4147
read(19, "", 8192) = 0
close(19) = 0
write(2, "--------------------------------"..., 1137--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
) = 1137
1) I wonder: the "template" in the ssh setup line:
-mca ess_base_vpid "<template>"
is hopefully correct? I assume, the number/ID one would expect here will be filled in upon execution?
2) I also executed that ssh line from cn21 to cn22 step by step including the env settings, and was able to launch ORTED without any error message.
tcp 0 0 0.0.0.0:48261 0.0.0.0:* LISTEN 2001 591373 71120/orted
tcp 0 0 127.0.0.1:52737 0.0.0.0:* LISTEN 2001 591372 71120/orted
3) I doublechecked the actual user facing error message with it's proposals for resolving the problems and compared it to the configuration of the compilation of OpenMPI:
Configure command line:
'--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5'
'--disable-static'
'--enable-builtin-atomics'
'--with-sge'
'--enable-mpi-cxx'
'--with-hwloc=/opt/ohpc/pub/libs/hwloc'
'--with-pmix=/opt/ohpc/admin/pmix'
'--with-libevent=external'
'--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0'
'--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0'
'--without-verbs'
'--with-tm=/opt/pbs/'
The library pathes are all present and accessible on both nodes.
4) And just to make sure: this OpenMPI is installed together with Slurm, but i disabled Slurm intentionally (all daemons shut down, Slurm's PAM lock for SSH is also disabled, so passwordless logins work flawlessly) to keep the debuggung environment simple. Also: i executed the mpirun directly on the compute nodes, no on the login node. I hope there is no remote chance left over, that a "dead" Slurm is still blocking access?
5)
The OOB is complaining that it was never given the connection information for calling back to mpirun. Hence, it has no way of connecting back. The question is why wasn't it given the info?
I believe to have cranked up all logging to it's max in another session afterwards but could not obtain any additional info. Could you please give me an idea where/what to look for in the output of the mpirun execution? Everything after "TCP STARTUP" is possibly too late?
If you would have any idea, i would be tremendously grateful, because currently i have the impression the problem is in the OpenMPI internal process communication, for which i have unfortunately little idea about how to look into, beyond of examining the logs...
Thank you.
First you should make sure there is no firewall between the hosts.
for example, on cn21
$nc -l 45851
and then from an other terminal, on cn22
$echo hello | 10.10.90.121 45851
hello
should be displayed on the first terminal.
Then you can run
$ ifconfig -a
$ netstat -nr
on cn22
to check whether there is routing between the two nodes or not.
Then if you want to use strace
, what you really want is to strace
the orted
daemon spawned on cn22
.
create a script orted.sh
like this
#!/bin/sh
strace -f /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@"
make it executable and then from cn21
$ /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent `pwd`/orted.sh --mca host cn22 --mca oob_tcp_if_include 10.10.90.0/24 -np 1 hostname
Hello @ggouaillardet, thank you very much for your proposal!
1) Routes are enabled for each of the interfaces of both multihomed hosts. Tue one used for messaging has not the highest priority, but ping still works well. 2) Firewall is not installed 3) The netcat test succeeds in both directions
I will now test your proposal with strace, which sounds very promising.
At this occasion: from "OpenMPI_easybuild_tech_talks_01_OpenMPI_part2" i gathered, "ethernet-only" networks (as of 2020)
Should i therefore better focus the strace on
--mca pml "obi" --mca btl "tcp,vader,sm,self"
?
Best
pml
is used by the MPI application only, and hostname
is obiously not one, so long story short, it does not matter here.
btw, there was a typo, it should be --mca pml ob1
Hello @ggouaillardet
@typo: ha, incomplete cognitive spillover ;-)
I hope i understood you right.
The wrapper:
orted_wrapper.sh
#!/bin/sh
time=$(date +%y%m%d_%H%M%S)
echo -e "\n[$(date +%y%m%d_%H%M%S)] $0 launched on $(hostname -s)\n\n"| tee -a orted_sh_${time}.log
strace -f /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@" >> orted_sh_${time}.log
exit 0
Because of different behaviour, i tested 2 syntax variants.
1) mpitestuser@cn21: mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --mca host cn22 hostname
2) mpitestuser@cn21: mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 hostname
[mpitestuser@cn21:tty0]()[~]$
date; strace /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --mca host cn22 /usr/bin/hostname; date
Sun Feb 25 07:18:42 PM CET 2024
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun", ["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., "--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh", "--mca", "oob_tcp_if_include", "10.10.90.0/24", "-np", "1", "--mca", "host", "cn22", "/usr/bin/hostname"], 0x7ffe60fe02b0 /* 39 vars */) = 0
brk(NULL)
[ etc. ]
chdir("/home/mpitestuser") = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=23, events=POLLIN}], 6, 0) = 0 (Timeout)
chdir("/home/mpitestuser") = 0
write(2, "[cn21:84550] [[17066,0],0] od"..., 71[cn21:84550] [[17066,0],0] odls:launch spawning child [[17066,1],0]
) = 71
write(2, "[cn21:84550] \n Data for app_c"..., 5483[cn21:84550]
Data for app_context: index 0 app: /usr/bin/hostname
Num procs: 1 FirstRank: 0
Argv[0]: /usr/bin/hostname
Env[0]: OMPI_MCA_orte_launch_agent=/home/mpitestuser/orted_wrapper.sh
Env[1]: OMPI_MCA_oob_tcp_if_include=10.10.90.0/24
Env[2]: OMPI_MCA_host=cn22
Env[3]: OMPI_MCA_pmix=^s1,s2,cray,isolated
Env[4]: PMIX_MCA_mca_base_component_show_load_errors=1
Env[5]: PMIX_DEBUG=100
Env[6]: OMPI_COMMAND=hostname
Env[7]: OMPI_MCA_orte_precondition_transports=51fe5876dec26368-9d75d092c8cfa2bd
Env[8]: SHELL=/bin/bash
Env[9]: GREP_COLOR=7;31;43
Env[10]: HISTCONTROL=ignoredups
Env[11]: HISTSIZE=
Env[12]: HOSTNAME=cn21
Env[13]: HISTTIMEFORMAT=[%F %T]
Env[14]: PWD=/home/mpitestuser
Env[15]: LOGNAME=mpitestuser
Env[16]: XDG_SESSION_TYPE=tty
Env[17]: MOTD_SHOWN=pam
Env[18]: HOME=/home/mpitestuser
Env[19]: LANG=en_US.UTF-8
Env[20]: HISTFILE=/home/mpitestuser/.bash_history_hf
Env[21]: LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36::di=96:su=30;41:sg=30;41
Env[22]: SSH_CONNECTION=10.10.90.100 38490 10.10.90.121 22
Env[23]: SLRMDEFENVRS=/usr/local/bin/slurm/slurmd/slrmdefenvvars
Env[24]: XDG_SESSION_CLASS=user
Env[25]: SELINUX_ROLE_REQUESTED=
Env[26]: TERM=xterm-256color
Env[27]: LESSOPEN=||/usr/bin/lesspipe.sh %s
Env[28]: USER=mpitestuser
Env[29]: SELINUX_USE_CURRENT_RANGE=
Env[30]: SHLVL=1
Env[31]: XDG_SESSION_ID=596
Env[32]: XDG_RUNTIME_DIR=/run/user/2001
Env[33]: S_COLORS=auto
Env[34]: PS1=\n\n\[\[\033[38;5;11m\][\u@\H:\[\]\[\033[38;5;190m\]tty\l\[\[\033[38;5;11m\]\[\033[38;5;11m\]\]]($(date +%y%m%d_%H%M%S))[\w]\[\033[38;5;81m\]$\[\033[0m\]\n \[\033[38;5;220m\]\[\033[48;5;24m\] \!.\#: \[\033[0m\]\[\]
Env[35]: SSH_CLIENT=10.10.90.100 38490 22
Env[36]: DEBUGINFOD_URLS=https://debuginfod.centos.org/
Env[37]: which_declare=declare -f
Env[38]: XDG_DATA_DIRS=/home/mpitestuser/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
Env[39]: PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/home/mpitestuser/.local/bin:/home/mpitestuser/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0/bin:/opt/ohpc/pub/compiler/gcc/12.2.0/bin
Env[40]: SELINUX_LEVEL_REQUESTED=
Env[41]: HISTFILESIZE=
Env[42]: DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/2001/bus
Env[43]: MAIL=/var/spool/mail/mpitestuser
Env[44]: SSH_TTY=/dev/pts/0
Env[45]: BASH_FUNC_which%%=() { ( alias;
eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
Env[46]: _=/usr/bin/strace
Env[47]: IPATH_NO_BACKTRACE=1
Env[48]: HFI_NO_BACKTRACE=1
Env[49]: LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib
Env[50]: OMPI_MCA_orte_local_daemon_uri=1118437376.0;tcp://10.10.90.121:44861
Env[51]: OMPI_MCA_orte_hnp_uri=1118437376.0;tcp://10.10.90.121:44861
Env[52]: OMPI_MCA_mpi_oversubscribe=0
Env[53]: OMPI_MCA_orte_app_num=0
Env[54]: OMPI_UNIVERSE_SIZE=96
Env[55]: OMPI_MCA_orte_num_nodes=1
Env[56]: OMPI_MCA_shmem_RUNTIME_QUERY_hint=mmap
Env[57]: OMPI_MCA_orte_bound_at_launch=1
Env[58]: OMPI_MCA_ess=^singleton
Env[59]: OMPI_MCA_orte_ess_num_procs=1
Env[60]: OMPI_COMM_WORLD_SIZE=1
Env[61]: OMPI_COMM_WORLD_LOCAL_SIZE=1
Env[62]: OMPI_MCA_orte_tmpdir_base=/tmp
Env[63]: OMPI_MCA_orte_top_session_dir=/tmp/ompi.cn21.2001
Env[64]: OMPI_MCA_orte_jobfam_session_dir=/tmp/ompi.cn21.2001/pid.84550
Env[65]: OMPI_NUM_APP_CTX=1
Env[66]: OMPI_FIRST_RANKS=0
Env[67]: OMPI_APP_CTX_NUM_PROCS=1
Env[68]: OMPI_MCA_initial_wdir=/home/mpitestuser
Env[69]: OMPI_MCA_orte_launch=1
Working dir: /home/mpitestuser
Prefix: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5
Used on node: TRUE
ORTE_ATTR: GLOBAL Data type: OPAL_STRING Key: APP-PREFIX-DIR Value: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5
ORTE_ATTR: LOCAL Data type: OPAL_INT32 Key: APP-MAX-RESTARTS Value: 0
) = 5483
pipe([25, 26]) = 0
[ etc. ]
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 9
newfstatat(9, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(9, 0x8d07e0 /* 9 entries */, 32768) = 272
close(9) = 0
munmap(0x7fbdeabf1000, 38280) = 0
munmap(0x7fbdeafcf000, 16912) = 0
munmap(0x7fbdeac2d000, 16912) = 0
munmap(0x7fbdeac28000, 16976) = 0
munmap(0x7fbdeac1c000, 48408) = 0
munmap(0x7fbdeac17000, 16912) = 0
write(2, "[cn21:84550] mca: base: close"..., 60[cn21:84550] mca: base: close: component weighted closed
) = 60
write(2, "[cn21:84550] mca: base: close"..., 63[cn21:84550] mca: base: close: unloading component weighted
) = 63
munmap(0x7fbdeac0d000, 16792) = 0
close(5) = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fbdeac89db0}, NULL, 8) = 0
close(3) = 0
close(4) = 0
munmap(0x7fbdeac12000, 17048) = 0
write(2, "[cn21:84550] mca: base: close"..., 65[cn21:84550] mca: base: close: unloading component posix_ipv4
) = 65
munmap(0x7fbdeab4c000, 21072) = 0
write(2, "[cn21:84550] mca: base: close"..., 58[cn21:84550] mca: base: close: component dlopen closed
) = 58
write(2, "[cn21:84550] mca: base: close"..., 61[cn21:84550] mca: base: close: unloading component dlopen
) = 61
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", 0x7fff09af80f0, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.84550", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
close(3) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001", {st_mode=S_IFDIR|0700, st_size=180, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.9774", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/pid.69893", {st_mode=S_IFDIR|0700, st_size=140, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31671", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31261", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31483", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.31431", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn21.2001/jf.29941", {st_mode=S_IFDIR|0700, st_size=100, ...}, 0) = 0
getdents64(3, 0x8d07e0 /* 0 entries */, 32768) = 0
close(3) = 0
openat(AT_FDCWD, "/tmp/ompi.cn21.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=180, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x8d07e0 /* 9 entries */, 32768) = 272
close(3) = 0
exit_group(0) = ?
+++ exited with 0 +++
Sun Feb 25 07:18:43 PM CET 2024
[mpitestuser@cn21:tty0]()[~]$
STDOUT only to CLI of the submitting user on cn21. No logfile ortedsh${time}.log was generated, therefore the wrapper was apparently not executed.
[mpitestuser@cn21:tty0]()[~]$
date; strace /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname; date
Sun Feb 25 07:20:31 PM CET 2024
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun", ["/opt/ohpc/pub/mpi/openmpi4-gnu12"..., "--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh", "--mca", "oob_tcp_if_include", "10.10.90.0/24", "-np", "1", "--host", "cn22", "/usr/bin/hostname"], 0x7ffee456e278 /* 39 vars */) = 0
brk(NULL) = 0x1d2f000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe425f4e70) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2d9d8b1000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
[ etc. ]
futex(0x1d67d58, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x1d67d08, FUTEX_WAKE_PRIVATE, 1) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
write(2, "[cn21:84646] [[16970,0],0] pl"..., 865[cn21:84646] [[16970,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template>
PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;
/home/mpitestuser/orted_wrapper.sh
-mca ess "env"
-mca ess_base_jobid "1112145920"
-mca ess_base_vpid "<template>"
-mca ess_base_num_procs "2"
-mca orte_node_regex "cn[2:21-22]@0(2)"
-mca orte_hnp_uri "1112145920.0;tcp://10.10.90.121:58463"
--mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh"
--mca oob_tcp_if_include "10.10.90.0/24"
-mca plm "rsh"
--tree-spawn
-mca routed "radix"
-mca orte_parent_uri "1112145920.0;tcp://10.10.90.121:58463"
-mca pmix "^s1,s2,cray,isolated"
) = 865
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2d9d39ea10) = 84649
setpgid(84649, 84649) = 0
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
[240225_192033] bash launched on cn22
execve("/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted",
["/opt/ohpc/pub/mpi/openmpi4-gnu12"...,
"-mca", "ess", "env",
"-mca", "ess_base_jobid", "1112145920",
"-mca", "ess_base_vpid", "1",
"-mca", "ess_base_num_procs", "2",
"-mca", "orte_node_regex", "cn[2:21-22]@0(2)",
"-mca", "orte_hnp_uri", "1112145920.0;tcp://10.10.90.121:"...,
"--mca", "orte_launch_agent", "/home/mpitestuser/orted_wrapper.sh",
"--mca", "oob_tcp_if_include", "10.10.90.0/24",
"-mca", "plm", "rsh",
"--tree-spawn",
"-mca", "routed", "radix", ...], 0x7ffd7bad58f0 /* 35 vars */) = 0
brk(NULL) = 0x10cc000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffecf1fe750) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2cc55c2000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v3/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v3", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v2/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/glibc-hwcaps/x86-64-v2", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/tls", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/x86_64", 0x7ffecf1fd980, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/lib/libopen-rte.so.40", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\273\1\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=834408, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 773144, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2cc5505000
mmap(0x7f2cc551f000, 512000, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f2cc551f000
mmap(0x7f2cc559c000, 122880, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x97000) = 0x7f2cc559c000
mmap(0x7f2cc55ba000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb4000) = 0x7f2cc55ba000
mmap(0x7f2cc55c0000, 7192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f2cc55c0000
close(3) = 0
[ etc. ]
write(2, "[cn22:77707] mca: base: close"..., 60[cn22:77707] mca: base: close: component weighted closed
) = 60
write(2, "[cn22:77707] mca: base: close"..., 63[cn22:77707] mca: base: close: unloading component weighted
) = 63
munmap(0x7f2cc50aa000, 16792) = 0
close(5) = 0
rt_sigaction(SIGCHLD, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f2cc5106db0}, NULL, 8) = 0
close(3) = 0
close(4) = 0
munmap(0x7f2cc544c000, 17048) = 0
write(2, "[cn22:77707] mca: base: close"..., 65[cn22:77707] mca: base: close: unloading component posix_ipv4
) = 65
munmap(0x7f2cc507e000, 25888) = 0
munmap(0x7f2cc506c000, 21072) = 0
munmap(0x7f2cc45f0000, 25296) = 0
munmap(0x7f2cc45e8000, 29368) = 0
munmap(0x7f2cc45dd000, 42200) = 0
munmap(0x7f2cc45d6000, 25256) = 0
munmap(0x7f2cc45cf000, 25136) = 0
munmap(0x7f2cc45c7000, 29232) = 0
write(2, "[cn22:77707] mca: base: close"..., 58[cn22:77707] mca: base: close: component dlopen closed
) = 58
write(2, "[cn22:77707] mca: base: close"..., 61[cn22:77707] mca: base: close: unloading component dlopen
) = 61
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", 0x7ffecf1fe320, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.16970", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
close(3) = 0
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001", {st_mode=S_IFDIR|0700, st_size=60, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
newfstatat(AT_FDCWD, "/tmp/ompi.cn22.2001/jf.9774", {st_mode=S_IFDIR|0700, st_size=120, ...}, 0) = 0
getdents64(3, 0x12b2080 /* 0 entries */, 32768) = 0
close(3) = 0
openat(AT_FDCWD, "/tmp/ompi.cn22.2001", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0700, st_size=60, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x12b2080 /* 3 entries */, 32768) = 80
close(3) = 0
exit_group(1) = ?
+++ exited with 1 +++
) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=84649, si_uid=2001, si_status=0, si_utime=15, si_stime=63} ---
write(4, "\21", 1) = 1
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\21", 1024) = 1
read(3, 0x7f2d9d5e9360, 1024) = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 84649
wait4(-1, 0x7ffe425f4ce4, WNOHANG, NULL) = -1 ECHILD (No child processes)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, 0) = 0 (Timeout)
poll([{fd=5, events=POLLIN}, {fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=18, events=POLLIN}], 4, -1
++++++ CLI STDOUT stuck here
^Cs ++ Cancelled on CLI
trace: Process 84646 detached
<detached ...>
[cn21:84646] [[16970,0],0] OOB_SEND: rml_oob_send.c:265
[cn21:84646] [[16970,0],0] OOB_SEND: rml_oob_send.c:265
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 0
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 0
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 1
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 1
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 2
[cn21:84646] [[16970,0],0] oob:base:send unknown peer [[16970,0],1]
[cn21:84646] [[16970,0],0] ext3x:client get on proc [[16970,0],1] key opal.puri
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 2
[cn21:84646] [[16970,0],0] oob:base:send known transport for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] oob:tcp:send_nb to peer [[16970,0],1]:15 seq = -1
[cn21:84646] [[16970,0],0]:[oob_tcp.c:188] processing send to peer [[16970,0],1]:15 seq_num = -1 hop [[16970,0],1] unknown
[cn21:84646] [[16970,0],0]:[oob_tcp.c:191] post no route to [[16970,0],1]
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] tcp:no route called for peer [[16970,0],1]
[cn21:84646] [[16970,0],0] OOB_SEND: oob_tcp_component.c:1123
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 3
[cn21:84646] [[16970,0],0] ACTIVATE PROC [[16970,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:84646] [[16970,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[cn21:84646] [[16970,0],0] oob:base:send to target [[16970,0],1] - attempt 3
[cn21:84646] [[16970,0],0] ACTIVATE PROC [[16970,0],1] STATE NO PATH TO TARGET AT base/rml_base_frame.c:234
[cn21:84646] [[16970,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_default_hnp.c:756
[cn21:84646] psquash: flex128 finalize
[cn21:84646] mca: base: close: component ext3x closed
[cn21:84646] mca: base: close: unloading component ext3x
[cn21:84646] mca: base: close: component default closed
[cn21:84646] mca: base: close: unloading component default
[cn21:84646] mca: base: close: unloading component radix
[cn21:84646] mca: base: close: unloading component direct
[cn21:84646] mca: base: close: unloading component binomial
[cn21:84646] mca: base: close: component rsh closed
[cn21:84646] mca: base: close: unloading component rsh
[cn21:84646] mca: base: close: component hnp closed
[cn21:84646] mca: base: close: unloading component hnp
[cn21:84646] [[16970,0],0] TCP SHUTDOWN
[cn21:84646] [[16970,0],0] TCP SHUTDOWN done
[cn21:84646] mca: base: close: component tcp closed
[cn21:84646] mca: base: close: unloading component tcp
[cn21:84646] mca: base: close: component weighted closed
[cn21:84646] mca: base: close: unloading component weighted
[cn21:84646] mca: base: close: unloading component posix_ipv4
[cn21:84646] mca: base: close: component dlopen closed
[cn21:84646] mca: base: close: unloading component dlopen
[mpitestuser@cn21:tty0]()[~]$
This generated the following logfile:
[mpitestuser@cn22:tty0]()[~]$
cat orted_sh_240225_192033.log
[240225_192033] bash launched on cn22
[mpitestuser@cn22:tty0]()[~]$
"Test 2" got stuck near its end, and i had to cancel the command. At that time i could not find any process related to the submitting user on cn22 any more.
Because the preview of this editor errored when pasting more log content, i attached a more complete log to this comment. If the complete (big) log would help, i would upload it any time.
Thank you for having a look!
Best
Sorry for my mistake in the command line, the second test was the good one. The output is still messy, so let's try this:
orted_wrapper.sh
#!/bin/sh
export OMPI_MCA_oob_base_verbose=100
exec strace -f -o orted.strace -s 512 -- /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/orted "$@"
and on cn21
, simply run
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca orte_launch_agent "/home/mpitestuser/orted_wrapper.sh" --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname
(please do not strace /.../mpirun
)
Then you can compress and upload orted.strace log file
Thank you, and sorry for the cluttered output.. I attached the strace according to your post. If i can check anything else, i'll be happy to this any time. Best issue12359_2.zip
Thanks, I suspect something fishy that involves PMIx (e.g. opal.puri
is not set for mpirun
)
Can you please run
/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5/bin/mpirun --mca oob_base_verbose 100 --mca pmix_base_verbose 100 --mca oob_tcp_if_include "10.10.90.0/24" -np 1 --host cn22 /usr/bin/hostname
and compress and share the output?
I attached the starce for your command. Just in case it matters to you, here details about pmix:
srun --mpi=list
MPI plugin types are...
cray_shasta
pmix
none
pmi2
specific pmix plugin versions available: pmix_v4
/opt/ohpc/admin/pmix/bin/pmix_info
Package: PMIx abuild@ip-172-31-13-34 Distribution
PMIX: 4.2.6
PMIX repo revision: gitf20e0d5d
PMIX release date: Sep 09, 2023
PMIX Standard: 4.2
PMIX Standard ABI: Stable (0.0), Provisional (0.0)
Prefix: /opt/ohpc/admin/pmix
Configured architecture: pmix.arch
Configure host: ip-172-31-13-34
Configured by: abuild
Configured on: Sun Sep 10 16:30:23 UTC 2023
Configure host: ip-172-31-13-34
Configure command line: '--prefix=/opt/ohpc/admin/pmix'
'--with-hwloc=/opt/ohpc/pub/libs/hwloc'
Built by: abuild
Built on: Sun Sep 10 16:31:56 UTC 2023
Built host: ip-172-31-13-34
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: "11" "." "3" "." "1"
Internal debug support: no
dl support: yes
Symbol vis. support: yes
Manpages built: yes
MCA bfrops: v12 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA bfrops: v20 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA bfrops: v21 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA bfrops: v3 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA bfrops: v4 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA bfrops: v41 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA gds: hash (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA gds: ds12 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA gds: ds21 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pcompress: zlib (MCA v2.1.0, API v2.0.0, Component v4.2.6)
MCA pdl: pdlopen (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pfexec: linux (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pif: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v4.2.6)
MCA pif: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v4.2.6)
MCA pinstalldirs: env (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pinstalldirs: config (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA plog: default (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA plog: stdfd (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA plog: syslog (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pmdl: ompi (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pmdl: oshmem (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pnet: opa (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA preg: compress (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA preg: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA preg: raw (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA prm: slurm (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA prm: default (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psec: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psec: none (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psensor: file (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psensor: heartbeat (MCA v2.1.0, API v1.0.0, Component
v4.2.6)
MCA pshmem: mmap (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psquash: flex128 (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA psquash: native (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA pstat: linux (MCA v2.1.0, API v1.0.0, Component v4.2.6)
MCA ptl: client (MCA v2.1.0, API v2.0.0, Component v4.2.6)
MCA ptl: server (MCA v2.1.0, API v2.0.0, Component v4.2.6)
MCA ptl: tool (MCA v2.1.0, API v2.0.0, Component v4.2.6)
issue12359_3.txt.zip Thank you again.
I do not need strace
for now.
But it seems you forgot to pass --mca pmix_base_verbose 100
to the mpirun
command line.
Oh, damn, here it comes: issue12359_4.txt.zip
I am running out of ideas for today...
what if you
export PMIX_MCA_gds=hash
and try again?
I don't see any earthshattering differences, unfortunately :-/ I don't know whether that might matter, but OHPC provides a pmi library for Slurm as a separate package, and i did not install it so far, because it seems to overwrite native pmi files.
So currently my system has installed:
openmpi4-pmix-gnu12-ohpc.x86_64 4.1.5-300.ohpc.5.2 @OpenHPC
pmix-ohpc.x86_64 4.2.6-300.ohpc.3.1 @OpenHPC
But not:
slurm-libpmi-ohpc.x86_64 22.05.11-302.ohpc.1.1 OpenHPC-updates
I attached the output for what you asked for in your last comment. issue12359_5.txt.zip
Should you have an idea what could be tried in the meantime, i would definitely give it a try.
you do not need slurm-libpmi-ohpc
, it should only be useful if you do something like srun --mpi=pmi2 ...
I am running out of ideas, and pmix was not built with --enable-debug, so we miss useful traces. There was an issue with PMIx > 4.2.2 (on the Open MPI side) and the fix is only available in Open MPI 4.1.6.
At this stage, I would recommend you build Open MPI 4.1.6 with the same options (but in your $HOME directory) and see if this fixes the issue.
Ahh...
something like srun --mpi=pmi2 ...
Does this refer to specifying a particular pmi, or "pmi2" in particular, or would "slurm-libpmi-ohpc" be mandatory for all MPI application executions, as soon as they are executed using the scheduler? I could not find an explicit info about how to deal with this package "slurm-libpmi-ohpc" at the OHPC end, but gathered warnings somewhere else about native PMI files being overwritten by their Slurm version...
In our environment, it would have to be Slurm all the way.
build Open MPI 4.1.6 with the same options
That is a good idea, and i will go this route, unless installing "slurm-libpmi-ohpc" would be mandatory anyways, in which case i would try that first, looking whether this might change the landscape.
I just wanted to update that i had to prefer a different approach, because i give priority to stick with using OHPC packages.
My current approach to deal with this now is:
Using "openmpi4-gnu12-ohpc.x86_64" immediately works. pmix-ohpc.x86_64 allows me to use pmix3.
The implied downside of not having the option to srun MPI applications directly is tolerable for the time being, but i will keep testing "openmpi4-pmix-gnu\<X>-ohpc" and switch to it as soon it will work.
Therefore i would then close the ticket?
Hello
i have a problem with a very generic and essential mpirun application of Open MPI: 4.1.5 over Ethernet, which seems to be a routing problem. But the underlying network has no access-/routing problems outside of the mpirun application.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI provided by OHPC:
Package: Open MPI abuild@ip-172-31-13-34 Distribution Open MPI: 4.1.5 Open MPI repo revision: v4.1.5 Open MPI release date: Feb 23, 2023 Open RTE: 4.1.5 Open RTE repo revision: v4.1.5 Open RTE release date: Feb 23, 2023 OPAL: 4.1.5 OPAL repo revision: v4.1.5 OPAL release date: Feb 23, 2023 MPI API: 3.1.0 Ident string: 4.1.5 Prefix: /opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5 Configured architecture: x86_64-pc-linux-gnu Configure host: ip-172-31-13-34 Configured by: abuild Configured on: Thu Aug 3 14:25:40 UTC 2023 Configure host: ip-172-31-13-34 Configure command line: '--prefix=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5' '--disable-static' '--enable-builtin-atomics' '--with-sge' '--enable-mpi-cxx' '--with-hwloc=/opt/ohpc/pub/libs/hwloc' '--with-pmix=/opt/ohpc/admin/pmix' '--with-libevent=external' '--with-libfabric=/opt/ohpc/pub/mpi/libfabric/1.18.0' '--with-ucx=/opt/ohpc/pub/mpi/ucx-ohpc/1.14.0' '--without-verbs' '--with-tm=/opt/pbs/' Built by: abuild
How Open MPI was installed
Installed via "openmpi4-pmix-gnu12-ohpc.x86_64" provided by OHPC 3.0, on Rocky 9.x.
System environment:
Details of the problem
Core problem apparently (sampled from the screen log):
Usecase 1: not specifying the network to use:
Usecase 2: specifying the network to use:
Because the network routing and name resolution works on the CLI, i guess the problem seems to be on the application level.
Any help would be great, since i am totally stuck at this point.