Open BhattaraiRajat opened 1 year ago
Looks like the remote host cannot open a TCP socket back to st27
. When you say:
If no add-hostfile option is given, both processes run without error.
are you running with both hosts in your hostfile? Or just the one?
I am running with just one host on the hostfile. I meant to say that there have been no problems running multiple prun commands without add-hostfile option.
I understood that last part. My point was just that if there is only one host in the hostfile, then you won't detect that your remote host cannot communicate to you.
Try putting both hosts in that hostfile and see if you can start the DVM.
I wanted to use the DVM expansion feature, starting DVM with one node and then add another node with add-hostfile
option on prun
.
st27
is the added node with add-hostfile
option on one of the prun
commands.
If I put both hosts in the hostfile, I can start the DVM fine.
I understand what you want to do - I'm just trying to check that PRRTE itself is behaving correctly. With both hosts in the hostfile, it starts - which means that the daemon can indeed communicate back.
So the question becomes: why can't it do so when started by add-hostfile? You were able to do it before, so what has changed?
If you run the two prun
commands sequentially on the cmd line, does that work?
If you run the two
prun
commands sequentially on the cmd line, does that work?
Yes. Running sequentially works.
I'm not familiar with Python's pool
module. I gather that it basically fork/exec's both prun
processes? Isn't there an inherent race condition here? Whichever prun
connects first to the DVM is going to execute first, so you may or may not get the other node involved in both jobs.
Let's see if the problem really is in PRRTE and not in how you are trying to do this. Add --prtemca pmix_server_verbose 5 --prtemca state_base_verbose 5 --leave-session-attached
to the prte
cmd line and see what it says.
Oh yeah - also add --prtemca plm_base_verbose 5
to the prte
cmd line
Yes. The project I am working on uses fork method for starting the processes using python multiprocessing module.
The following is the log after it receives prun commands.
[st-master:2558116] [prte-st-master-2558116@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2558116] [prte-st-master-2558116@0,0] PROCESSING TOOL CONNECTION
[st-master:2558116] [prte-st-master-2558116@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2558116] [prte-st-master-2558116@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2558144
[st-master:2558116] [prte-st-master-2558116@0,0] PROCESSING TOOL CONNECTION
[st-master:2558116] [prte-st-master-2558116@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2558143
[st-master:2558116] [prte-st-master-2558116@0,0] spawn upcalled on behalf of proc prun.st-master.2558144:0 with 5 job infos
[st-master:2558116] [prte-st-master-2558116@0,0] spawn called from proc [prun.st-master.2558144,0] with 1 apps
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive processing msg
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive job launch command from [prte-st-master-2558116@0,0]
[st-master:2558116] [prte-st-master-2558116@0,0] spawn upcalled on behalf of proc prun.st-master.2558143:0 with 5 job infos
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive adding hosts
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive calling spawn
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489563] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_ssh_module.c:910
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489586] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive done processing commands
[st-master:2558116] [prte-st-master-2558116@0,0] spawn called from proc [prun.st-master.2558143,0] with 1 apps
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive processing msg
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive job launch command from [prte-st-master-2558116@0,0]
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive adding hosts
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive calling spawn
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489835] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_ssh_module.c:910
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489847] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive done processing commands
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_job
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489889] ACTIVATE JOB prte-st-master-2558116@1 STATE INIT_COMPLETE AT base/plm_base_launch_support.c:695
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489902] ACTIVATING JOB prte-st-master-2558116@1 STATE INIT_COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_job
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489925] ACTIVATE JOB prte-st-master-2558116@2 STATE INIT_COMPLETE AT base/plm_base_launch_support.c:695
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489936] ACTIVATING JOB prte-st-master-2558116@2 STATE INIT_COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489947] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING ALLOCATION AT state_dvm.c:257
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489959] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING ALLOCATION PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489971] ACTIVATE JOB prte-st-master-2558116@2 STATE PENDING ALLOCATION AT state_dvm.c:257
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489981] ACTIVATING JOB prte-st-master-2558116@2 STATE PENDING ALLOCATION PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489993] ACTIVATE JOB prte-st-master-2558116@1 STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:745
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490003] ACTIVATING JOB prte-st-master-2558116@1 STATE ALLOCATION COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490014] ACTIVATE JOB prte-st-master-2558116@2 STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:745
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490024] ACTIVATING JOB prte-st-master-2558116@2 STATE ALLOCATION COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490035] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:201
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490048] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING DAEMON LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490060] ACTIVATE JOB prte-st-master-2558116@2 STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:201
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490071] ACTIVATING JOB prte-st-master-2558116@2 STATE PENDING DAEMON LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm no new daemons required
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490103] ACTIVATE JOB prte-st-master-2558116@1 STATE ALL DAEMONS REPORTED AT plm_ssh_module.c:1056
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490113] ACTIVATING JOB prte-st-master-2558116@1 STATE ALL DAEMONS REPORTED PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm add new daemon [prte-st-master-2558116@0,2]
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm assigning new daemon [prte-st-master-2558116@0,2] to node st27
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: launching vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: local shell: 0 (bash)
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: assuming same remote shell as local shell
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: remote shell: 0 (bash)
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/home/rbhattara/dyn-wf/install/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/rbhattara/dyn-wf/install/prrte/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prte-st-master-2558116@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prte-st-master-2558116@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:51773:24,23,23" --prtemca pmix_server_verbose "5" --prtemca state_base_verbose "5" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh"
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh:launch daemon already exists on node st-master
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh:launch daemon already exists on node st26
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: adding node st27 to launch list
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: activating launch event
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setting slots for node st27 by core
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490490] ACTIVATE JOB prte-st-master-2558116@1 STATE VM READY AT base/plm_base_launch_support.c:177
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490504] ACTIVATING JOB prte-st-master-2558116@1 STATE VM READY PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: recording launch of daemon [prte-st-master-2558116@0,2]
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491026] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING MAPPING AT state_dvm.c:244
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491045] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING MAPPING PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh st27 PRTE_PREFIX=/home/rbhattara/dyn-wf/install/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/rbhattara/dyn-wf/install/prrte/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prte-st-master-2558116@0" --prtemca ess_base_vpid 2 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prte-st-master-2558116@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:51773:24,23,23" --prtemca pmix_server_verbose "5" --prtemca state_base_verbose "5" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh"]
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491335] ACTIVATE JOB prte-st-master-2558116@1 STATE MAP COMPLETE AT base/rmaps_base_map_job.c:904
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491354] ACTIVATING JOB prte-st-master-2558116@1 STATE MAP COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491370] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:275
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491383] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING FINAL SYSTEM PREP PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] complete_setup on job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491402] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:736
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491414] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING APP LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:launch_apps for job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491643] ACTIVATE JOB prte-st-master-2558116@1 STATE SENDING LAUNCH MSG AT base/odls_base_default_fns.c:146
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491667] ACTIVATING JOB prte-st-master-2558116@1 STATE SENDING LAUNCH MSG PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:send launch msg for job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] register nspace for prte-st-master-2558116@1
[st-master:2558116] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2558116] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2558116] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494709] ACTIVATE PROC [prte-st-master-2558116@0,2] STATE NO PATH TO TARGET AT rml/rml.c:123
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494736] ACTIVATING PROC [prte-st-master-2558116@0,2] STATE NO PATH TO TARGET PRI 0
[st-master:2558116] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494765] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_dvm.c:342
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494777] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive stop comm
[st26:869495] [prte-st-master-2558116@0,1] register nspace for prte-st-master-2558116@1
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] [prte-st-master-2558116@0,0] Finalizing PMIX server
[rbhattara@st-master add-host-debug]$ [st26:869495] [prte-st-master-2558116@0,1] register nspace for prte-st-master-2558116@1
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] [prte-st-master-2558116@0,1] [1690324866.576490] ACTIVATE PROC [prte-st-master-2558116@0,0] STATE LIFELINE LOST AT oob_tcp_component.c:881
[st26:869495] [prte-st-master-2558116@0,1] [1690324866.576513] ACTIVATING PROC [prte-st-master-2558116@0,0] STATE LIFELINE LOST PRI 0
[st26:869495] [prte-st-master-2558116@0,1] plm:base:receive stop comm
[st26:869495] [prte-st-master-2558116@0,1] Finalizing PMIX server
[st27:800778] [prte-st-master-2558116@0,2] plm:ssh_lookup on agent ssh : rsh path NULL
[st27:800778] [prte-st-master-2558116@0,2] plm:ssh_setup on agent ssh : rsh path NULL
[st27:800778] [prte-st-master-2558116@0,2] plm:base:receive start comm
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: st27
Remote host: 192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[st27:800778] [prte-st-master-2558116@0,2] [1690324866.972653] ACTIVATE PROC [prte-st-master-2558116@0,0] STATE FAILED TO CONNECT AT oob_tcp_component.c:1022
[st27:800778] [prte-st-master-2558116@0,2] [1690324866.972669] ACTIVATING PROC [prte-st-master-2558116@0,0] STATE FAILED TO CONNECT PRI 0
[st27:800778] [prte-st-master-2558116@0,2] plm:base:receive stop comm
[st27:800778] [prte-st-master-2558116@0,2] Finalizing PMIX server
Yes. The project I am working on uses fork method for starting the processes using python multiprocessing module.
Understood - but as implemented, your test will yield non-deterministic results. Is that what you want?
Try adding --prtemca oob_base_verbose 5 --prtemca rml_base_verbose 5
to the prte
cmd line. It looks like we simply cannot create a socket connection back to prte
for some reason.
This is the log with --prtemca oob_base_verbose 5 --prtemca rml_base_verbose 5
[st-master:2560223] [prte-st-master-2560223@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2560223] [prte-st-master-2560223@0,0] PROCESSING TOOL CONNECTION
[st-master:2560223] [prte-st-master-2560223@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2560223] [prte-st-master-2560223@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2560245
[st-master:2560223] [prte-st-master-2560223@0,0] PROCESSING TOOL CONNECTION
[st-master:2560223] [prte-st-master-2560223@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2560246
[st-master:2560223] [prte-st-master-2560223@0,0] spawn upcalled on behalf of proc prun.st-master.2560245:0 with 5 job infos
[st-master:2560223] [prte-st-master-2560223@0,0] spawn called from proc [prun.st-master.2560245,0] with 1 apps
[st-master:2560223] RML-SEND(0:5): prted/pmix/pmix_server_dyn.c:spawn:178
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] spawn upcalled on behalf of proc prun.st-master.2560246:0 with 5 job infos
[st-master:2560223] [prte-st-master-2560223@0,0] message received 465 bytes from [prte-st-master-2560223@0,0] for tag 5 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 5 on released
[st-master:2560223] [prte-st-master-2560223@0,0] spawn called from proc [prun.st-master.2560246,0] with 1 apps
[st-master:2560223] RML-SEND(0:5): prted/pmix/pmix_server_dyn.c:spawn:178
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received 408 bytes from [prte-st-master-2560223@0,0] for tag 5 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 5 on released
[st-master:2560223] RML-SEND(0:15): grpcomm_direct.c:xcast:99
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 15
[st-master:2560223] RML-SEND(1:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 1 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] RML-SEND(1:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 1 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] RML-SEND(2:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 2 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] [prte-st-master-2560223@0,0] Message posted at grpcomm_direct.c:702 for tag 1
[st-master:2560223] [prte-st-master-2560223@0,0] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 15 on released
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,1] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,1]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,1]:15 seq_num = -1 via [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] tcp:send_nb: already connected to [prte-st-master-2560223@0,1] - queueing for send
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:197] queue send to [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,1] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,1]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,1]:15 seq_num = -1 via [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] tcp:send_nb: already connected to [prte-st-master-2560223@0,1] - queueing for send
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:197] queue send to [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send unknown peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 1
[st-master:2560223] [prte-st-master-2560223@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st-master:2560223] [prte-st-master-2560223@0,0] prted_cmd: received add_local_procs
[st-master:2560223] [prte-st-master-2560223@0,0] register nspace for prte-st-master-2560223@2
[st-master:2560223] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2560223] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2560223] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] [prte-st-master-2560223@0,0] message received 724 bytes from [prte-st-master-2560223@0,0] for tag 1 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 1 on released
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 1
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 2
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 3
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,2] Send message complete at base/oob_base_stubs.c:61
[st-master:2560223] [prte-st-master-2560223@0,0] UNABLE TO SEND MESSAGE TO [prte-st-master-2560223@0,2] TAG 15: No OOB path to target
[st-master:2560223] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2560223] oob:tcp:send_handler SENDING MSG
[st-master:2560223] [prte-st-master-2560223@0,0] MESSAGE SEND COMPLETE TO [prte-st-master-2560223@0,1] OF 775 BYTES ON SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,1] Send message complete at oob_tcp_sendrecv.c:253
[st-master:2560223] oob:tcp:send_handler SENDING MSG
[st-master:2560223] [prte-st-master-2560223@0,0] MESSAGE SEND COMPLETE TO [prte-st-master-2560223@0,1] OF 775 BYTES ON SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,1] Send message complete at oob_tcp_sendrecv.c:253
[st-master:2560223] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82
[st26:870061] [prte-st-master-2560223@0,1] Message posted at oob_tcp_sendrecv.c:522 for tag 15
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,0] for tag 15
[st26:870061] [prte-st-master-2560223@0,1] Message posted at grpcomm_direct.c:702 for tag 1
[st26:870061] [prte-st-master-2560223@0,1] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 15 on released
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,1] for tag 1
[st26:870061] [prte-st-master-2560223@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st26:870061] [prte-st-master-2560223@0,1] prted_cmd: received add_local_procs
[st-master:2560223] RML-CANCEL(5): base/plm_base_receive.c:prte_plm_base_comm_stop:102
[st-master:2560223] RML-CANCEL(10): base/plm_base_receive.c:prte_plm_base_comm_stop:104
[st-master:2560223] RML-CANCEL(12): base/plm_base_receive.c:prte_plm_base_comm_stop:105
[st-master:2560223] RML-CANCEL(62): base/plm_base_receive.c:prte_plm_base_comm_stop:106
[st26:870061] [prte-st-master-2560223@0,1] register nspace for prte-st-master-2560223@2
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] [prte-st-master-2560223@0,0] TCP SHUTDOWN
[st-master:2560223] [prte-st-master-2560223@0,0] TCP SHUTDOWN done
[st-master:2560223] [prte-st-master-2560223@0,0] CLOSING SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0] Finalizing PMIX server
[rbhattara@st-master add-host-debug]$ [st26:870061] [prte-st-master-2560223@0,1] message received 724 bytes from [prte-st-master-2560223@0,1] for tag 1 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 1 on released
[st26:870061] [prte-st-master-2560223@0,1] Message posted at oob_tcp_sendrecv.c:522 for tag 15
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,0] for tag 15
[st26:870061] [prte-st-master-2560223@0,1] Message posted at grpcomm_direct.c:702 for tag 1
[st26:870061] [prte-st-master-2560223@0,1] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 15 on released
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,1] for tag 1
[st26:870061] [prte-st-master-2560223@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st26:870061] [prte-st-master-2560223@0,1] prted_cmd: received add_local_procs
[st26:870061] [prte-st-master-2560223@0,1] register nspace for prte-st-master-2560223@2
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] [prte-st-master-2560223@0,1] message received 724 bytes from [prte-st-master-2560223@0,1] for tag 1 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 1 on released
[st26:870061] [prte-st-master-2560223@0,1]-[prte-st-master-2560223@0,0] prte_oob_tcp_msg_recv: peer closed connection
[st26:870061] [prte-st-master-2560223@0,1]:errmgr_prted.c(365) updating exit status to 1
[st26:870061] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82
[st26:870061] RML-CANCEL(3): iof_prted.c:finalize:295
[st26:870061] RML-CANCEL(5): base/plm_base_receive.c:prte_plm_base_comm_stop:102
[st26:870061] [prte-st-master-2560223@0,1] TCP SHUTDOWN
[st26:870061] no hnp or not active
[st26:870061] [prte-st-master-2560223@0,1] TCP SHUTDOWN done
[st26:870061] [prte-st-master-2560223@0,1] Finalizing PMIX server
Daemon was launched on st27 - beginning to initialize
[st27:801326] mca:oob:select: checking available component tcp
[st27:801326] mca:oob:select: Querying component [tcp]
[st27:801326] oob:tcp: component_available called
[st27:801326] [prte-st-master-2560223@0,2] TCP STARTUP
[st27:801326] [prte-st-master-2560223@0,2] attempting to bind to IPv4 port 0
[st27:801326] mca:oob:select: Adding component to end
[st27:801326] mca:oob:select: Found 1 active transports
[st27:801326] RML-RECV(27): runtime/prte_data_server.c:prte_data_server_init:150
[st27:801326] RML-RECV(50): prted/pmix/pmix_server.c:pmix_server_start:886
[st27:801326] RML-RECV(51): prted/pmix/pmix_server.c:pmix_server_start:890
[st27:801326] RML-RECV(6): prted/pmix/pmix_server.c:pmix_server_start:894
[st27:801326] RML-RECV(28): prted/pmix/pmix_server.c:pmix_server_start:898
[st27:801326] RML-RECV(59): prted/pmix/pmix_server.c:pmix_server_start:902
[st27:801326] RML-RECV(24): prted/pmix/pmix_server.c:pmix_server_start:906
[st27:801326] RML-RECV(15): grpcomm_direct.c:init:74
[st27:801326] RML-RECV(33): grpcomm_direct.c:init:76
[st27:801326] RML-RECV(31): grpcomm_direct.c:init:79
[st27:801326] RML-RECV(5): base/plm_base_receive.c:prte_plm_base_comm_start:79
[st27:801326] RML-RECV(3): iof_prted.c:init:98
[st27:801326] RML-RECV(21): filem_raw_module.c:raw_init:113
[st27:801326] RML-RECV(1): prted.c:main:449
[st27:801326] RML-RECV(10): prted.c:main:504
[st27:801326] RML-SEND(0:10): prted.c:main:715
[st27:801326] [prte-st-master-2560223@0,2] rml_send_buffer to peer 0 at tag 10
[st27:801326] [prte-st-master-2560223@0,2] OOB_SEND: rml/rml_send.c:89
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 27 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 50 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 51 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 6 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 28 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 59 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 24 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 15 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 33 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 31 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 5 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 3 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 21 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 1 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 10 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] oob:base:send to target [prte-st-master-2560223@0,0] - attempt 0
[st27:801326] [prte-st-master-2560223@0,2] oob:base:send unknown peer [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:set_addr processing uri prte-st-master-2560223@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:35289:24,23,23
[st27:801326] [prte-st-master-2560223@0,2]:set_addr checking if peer [prte-st-master-2560223@0,0] is reachable via component tcp
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp: working peer [prte-st-master-2560223@0,0] address tcp://10.15.3.34,172.16.0.254,192.168.0.254:35289:24,23,23
[st27:801326] [prte-st-master-2560223@0,2]: peer [prte-st-master-2560223@0,0] is reachable via component tcp
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:send_nb to peer [prte-st-master-2560223@0,0]:10 seq = -1
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,0]:10 seq_num = -1 via [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:204] queue pending to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] tcp:send_nb: initiating connection to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:216] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: st27
Remote host: 192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[st27:801326] [prte-st-master-2560223@0,2]:errmgr_prted.c(365) updating exit status to 1
[st27:801326] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82
Just for clarification - is this that Slurm environment again? If so, that could well be the problem.
I am running under the Slurm allocation but with prrte and pmix built with --with-slurm=no
option.
Sounds suspicious - try running prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile
and let's see what it thinks it got.
This is the result for prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile
after it receives prun commands.
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate allocation already read
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:add_hosts checking add-hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:node_insert node st27 slots 64
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: checking hostfile /home/rbhattara/add-host-debug/hostfile for nodes
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: adding node st26 slots 64
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate allocation already read
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: filtering nodes through hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: node st27 is being included - keep all is FALSE
[st-master:2612696] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2612696] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
Sigh - I want to see the output when it immediately starts up, please.
Question: is prte
itself running on the node in your initial hostfile
? Or is it running on a login node that isn't included in the hostfile
?
Sigh - I want to see the output when it immediately starts up, please.
I am sorry. Here is the full output.
[st-master add-host-debug]$ prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile --report-uri dvm.uri
[st-master:2817884] mca: base: component_find: searching NULL for ras components
[st-master:2817884] mca: base: find_dyn_components: checking NULL for ras components
[st-master:2817884] pmix:mca: base: components_register: registering framework ras components
[st-master:2817884] pmix:mca: base: components_register: found loaded component simulator
[st-master:2817884] pmix:mca: base: components_register: component simulator register function successful
[st-master:2817884] pmix:mca: base: components_register: found loaded component pbs
[st-master:2817884] pmix:mca: base: components_register: component pbs register function successful
[st-master:2817884] mca: base: components_open: opening ras components
[st-master:2817884] mca: base: components_open: found loaded component simulator
[st-master:2817884] mca: base: components_open: found loaded component pbs
[st-master:2817884] mca: base: components_open: component pbs open function successful
[st-master:2817884] mca:base:select: Auto-selecting ras components
[st-master:2817884] mca:base:select:( ras) Querying component [simulator]
[st-master:2817884] mca:base:select:( ras) Querying component [pbs]
[st-master:2817884] mca:base:select:( ras) No component selected!
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate nothing found in module - proceeding to hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate adding hostfile hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
DVM ready
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate allocation already read
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile /home/rbhattara/add-host-debug/hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:add_hosts checking add-hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert node st27 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate allocation already read
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: filtering nodes through hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st27 is being included - keep all is FALSE
[st-master:2817884] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2817884] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
Question: is
prte
itself running on the node in your initialhostfile
? Or is it running on a login node that isn't included in thehostfile
?
prte
is running on login node. st-master
. hostfile
contains st26
.
I found another strange issue while running prun
with add-hostfile
option.
Sometimes prun is seems to be running one more copy of the program than specified in -np
option.
[st-master add-host-debug]$ prun --dvm-uri file:dvm.uri --add-hostfile add_hostfile -np 2 hostname
st26
st26
[st-master add-host-debug]$ prun --dvm-uri file:dvm.uri --add-hostfile add_hostfile -np 2 hostname
st26
st26
st26
Okay, that confirms the setup - no Slurm interactions. I'm afraid this will take a while to track down. It's some kind of race condition, though the precise nature of it remains hard to see. Unfortunately, I'm pretty occupied right now, which will further delay things.
I'd suggest you run those prun
cmds sequentially for now as that seems to be working.
Sometimes prun is seems to be running one more copy of the program than specified in -np option.
No ideas - I can try to reproduce, but don't know if/when I'll be able to do so.
Background information
Working on a project, one part of which runs multiple
prun
commands in parallel from multiple processes to launch multiple tasks, some of these commands with--add-hostfile
option to extend existing DVM.What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
55536ef714d30f681456c405c1e0857449de83c0
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
openpmix/openpmix@bde8038fd9963057d86fd00864dfebf819b232a0
Please describe the system on which you are running
Network type:
Details of the problem
Steps to reproduce
hostfile
with one node andadd_hostfile
with another nodehostfile
to start DVMprte --report-uri dvm.uri --hostfile hostfile
prun
commands in parallel, one withadd-hostfile
option and another without as the following.def run(x): print(x) process = subprocess.Popen(x, stdout=subprocess.PIPE, shell=True) output,error = process.communicate()
prun_commands = ["prun --display allocation --dvm-uri file:dvm.uri --map-by ppr:2:node -n 2 hostname > out0", "prun --display allocation --dvm-uri file:dvm.uri --add-hostfile add_hostfile --map-by ppr:2:node -n 2 hostname > out1"]
with Pool(2) as p: p.map(run, prun_commands)
[st-master:2320852] PMIx_Spawn failed (-25): UNREACHABLE [st-master:2320851] PMIx_Spawn failed (-25): UNREACHABLE
A process or daemon was unable to complete a TCP connection to another process: Local host: st27 Remote host: 192.168.0.254 This is usually caused by a firewall on the remote host. Please check that any firewall (e.g., iptables) has been disabled and try again.