open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.07k stars 844 forks source link

mpirun V5.0.3 hangs running hello_world in two nodes #12645

Open sdonoso opened 1 week ago

sdonoso commented 1 week ago

when I try to run the following command the process hangs and does not finish

rene@puente:~/nccl-tests$ mpirun -hostfile hostfile -np 2 --mca pml ucx  --map-by ppr:1:node ./hello_world
Hello world from rank 1 out of 2 processors

The Environment I have two nodes with the next specs

OS:Ubuntu 22.04

mpirun (Open MPI) 5.0.3

MLNX_OFED_LINUX-24.04-0.6.6.0:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

i can ssh do between nodes. i can ping between nodes.

output from For problems launching MPI or OpenSHMEM applications

rene@puente:~/nccl-tests$ mpirun --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:18852] mca: base: component_find: searching NULL for plm components
[puente:18852] mca: base: find_dyn_components: checking NULL for plm components
[puente:18852] pmix:mca: base: components_register: registering framework plm components
[puente:18852] pmix:mca: base: components_register: found loaded component slurm
[puente:18852] pmix:mca: base: components_register: component slurm register function successful
[puente:18852] pmix:mca: base: components_register: found loaded component ssh
[puente:18852] pmix:mca: base: components_register: component ssh register function successful
[puente:18852] mca: base: components_open: opening plm components
[puente:18852] mca: base: components_open: found loaded component slurm
[puente:18852] mca: base: components_open: component slurm open function successful
[puente:18852] mca: base: components_open: found loaded component ssh
[puente:18852] mca: base: components_open: component ssh open function successful
[puente:18852] mca:base:select: Auto-selecting plm components
[puente:18852] mca:base:select:(  plm) Querying component [slurm]
[puente:18852] mca:base:select:(  plm) Querying component [ssh]
[puente:18852] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:18852] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:18852] mca:base:select:(  plm) Selected component [ssh]
[puente:18852] mca: base: close: component slurm closed
[puente:18852] mca: base: close: unloading component slurm
[puente:18852] [prterun-puente-18852@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive start comm
[puente:18852] mca: base: component_find: searching NULL for rmaps components
[puente:18852] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:18852] pmix:mca: base: components_register: registering framework rmaps components
[puente:18852] pmix:mca: base: components_register: found loaded component ppr
[puente:18852] pmix:mca: base: components_register: component ppr register function successful
[puente:18852] pmix:mca: base: components_register: found loaded component rank_file
[puente:18852] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:18852] pmix:mca: base: components_register: found loaded component round_robin
[puente:18852] pmix:mca: base: components_register: component round_robin register function successful
[puente:18852] pmix:mca: base: components_register: found loaded component seq
[puente:18852] pmix:mca: base: components_register: component seq register function successful
[puente:18852] mca: base: components_open: opening rmaps components
[puente:18852] mca: base: components_open: found loaded component ppr
[puente:18852] mca: base: components_open: component ppr open function successful
[puente:18852] mca: base: components_open: found loaded component rank_file
[puente:18852] mca: base: components_open: found loaded component round_robin
[puente:18852] mca: base: components_open: component round_robin open function successful
[puente:18852] mca: base: components_open: found loaded component seq
[puente:18852] mca: base: components_open: component seq open function successful
[puente:18852] mca:rmaps:select: checking available component ppr
[puente:18852] mca:rmaps:select: Querying component [ppr]
[puente:18852] mca:rmaps:select: checking available component rank_file
[puente:18852] mca:rmaps:select: Querying component [rank_file]
[puente:18852] mca:rmaps:select: checking available component round_robin
[puente:18852] mca:rmaps:select: Querying component [round_robin]
[puente:18852] mca:rmaps:select: checking available component seq
[puente:18852] mca:rmaps:select: Querying component [seq]
[puente:18852] [prterun-puente-18852@0,0]: Final mapper priorities
[puente:18852]  Mapper: rank_file Priority: 100
[puente:18852]  Mapper: ppr Priority: 90
[puente:18852]  Mapper: seq Priority: 60
[puente:18852]  Mapper: round_robin Priority: 10
[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_vm
[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_vm creating map
[puente:18852] [prterun-puente-18852@0,0] setup:vm: working unmanaged allocation
[puente:18852] [prterun-puente-18852@0,0] using default hostfile /usr/local/openmpi/etc/prte-default-hostfile

[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_vm only HNP in allocation
======================   ALLOCATED NODES   ======================
    puente: slots=1 max_slots=0 slots_inuse=0 state=UP
[puente:18852] [prterun-puente-18852@0,0] plm:base:setting slots for node puente by core
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
    aliases: puente
=================================================================

======================   ALLOCATED NODES   ======================
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: puente
=================================================================
[puente:18852] [prterun-puente-18852@0,0] rmaps:base set policy with ppr:1:node
[puente:18852] [prterun-puente-18852@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive processing msg
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive job launch command from [prterun-puente-18852@0,0]
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive adding hosts
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive calling spawn
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive done processing commands
[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_job
[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_vm
[puente:18852] [prterun-puente-18852@0,0] plm:base:setup_vm no new daemons required
[puente:18852] mca:rmaps: mapping job prterun-puente-18852@1
[puente:18852] mca:rmaps: setting mapping policies for job prterun-puente-18852@1 inherit TRUE hwtcpus FALSE
[puente:18852] [prterun-puente-18852@0,0] using known nodes
[puente:18852] [prterun-puente-18852@0,0] Starting with 1 nodes in list
[puente:18852] [prterun-puente-18852@0,0] Filtering thru apps
[puente:18852] [prterun-puente-18852@0,0] Retained 1 nodes in list
[puente:18852] [prterun-puente-18852@0,0] node puente has 128 slots available
[puente:18852] AVAILABLE NODES FOR MAPPING:

[puente:18852]     node: puente daemon: 0 slots_available: 128
======================   ALLOCATED NODES   ======================
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:18852] setdefaultbinding[366] binding not given - using bycore
    aliases: puente
=================================================================
[puente:18852] mca:rmaps:rf: job prterun-puente-18852@1 not using rankfile policy
[puente:18852] mca:rmaps:ppr: mapping job prterun-puente-18852@1 with ppr 1:node
[puente:18852] mca:rmaps:ppr: job prterun-puente-18852@1 assigned policy BYNODE:SLOT
[puente:18852] [prterun-puente-18852@0,0] using known nodes
[puente:18852] [prterun-puente-18852@0,0] Starting with 1 nodes in list
[puente:18852] [prterun-puente-18852@0,0] Filtering thru apps
[puente:18852] [prterun-puente-18852@0,0] Retained 1 nodes in list
[puente:18852] [prterun-puente-18852@0,0] node puente has 128 slots available
[puente:18852] AVAILABLE NODES FOR MAPPING:
[puente:18852]     node: puente daemon: 0 slots_available: 128
[puente:18852] [prterun-puente-18852@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:18852] mca:rmaps: compute bindings for job prterun-puente-18852@1 with policy CORE:IF-SUPPORTED[1007]
[puente:18852] mca:rmaps: bind [prterun-puente-18852@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:18852] [prterun-puente-18852@0,0] BOUND PROC [prterun-puente-18852@1,INVALID][puente] TO package[0][core:0]
[puente:18852] [prterun-puente-18852@0,0] complete_setup on job prterun-puente-18852@1
[puente:18852] [prterun-puente-18852@0,0] plm:base:launch_apps for job prterun-puente-18852@1
[puente:18852] [prterun-puente-18852@0,0] plm:base:send launch msg for job prterun-puente-18852@1
[puente:18852] [prterun-puente-18852@0,0] plm:base:launch wiring up iof for job prterun-puente-18852@1
puente
[puente:18852] [prterun-puente-18852@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:18852] [prterun-puente-18852@0,0] plm:base:receive stop comm
[puente:18852] mca: base: close: component ssh closed
[puente:18852] mca: base: close: unloading component ssh
wenduwan commented 1 week ago

@sdonoso We have ingested multiple runtime fixes since 5.0.1. Can you reproduce on 5.0.3?

For example, we fixed this a while ago https://github.com/open-mpi/ompi/issues/12064

sdonoso commented 1 week ago

I am using version 5.0.3

rene@puente:~/nccl-tests$ mpirun --version
mpirun (Open MPI) 5.0.3
wenduwan commented 1 week ago

Thanks. I updated the issue title.

janjust commented 6 days ago

@sdonoso any chance your LD_LIBRARY_PATH isn't propagated to the other node? If you add your MPI libs into the path and forward it via -x LD_LIBRARY_PATH?

sdonoso commented 6 days ago

what do you mean with that the LD_LIBRARY_PATH is not propagated?

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile hostfile -np 2 --mca pml ucx  --map-by ppr:1:node ./hello_world
Hello world from rank 1 out of 2 processors

I have the same result

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:218375] mca: base: component_find: searching NULL for plm components
[puente:218375] mca: base: find_dyn_components: checking NULL for plm components
[puente:218375] pmix:mca: base: components_register: registering framework plm components
[puente:218375] pmix:mca: base: components_register: found loaded component slurm
[puente:218375] pmix:mca: base: components_register: component slurm register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component ssh
[puente:218375] pmix:mca: base: components_register: component ssh register function successful
[puente:218375] mca: base: components_open: opening plm components
[puente:218375] mca: base: components_open: found loaded component slurm
[puente:218375] mca: base: components_open: component slurm open function successful
[puente:218375] mca: base: components_open: found loaded component ssh
[puente:218375] mca: base: components_open: component ssh open function successful
[puente:218375] mca:base:select: Auto-selecting plm components
[puente:218375] mca:base:select:(  plm) Querying component [slurm]
[puente:218375] mca:base:select:(  plm) Querying component [ssh]
[puente:218375] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:218375] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:218375] mca:base:select:(  plm) Selected component [ssh]
[puente:218375] mca: base: close: component slurm closed
[puente:218375] mca: base: close: unloading component slurm
[puente:218375] [prterun-puente-218375@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive start comm
[puente:218375] mca: base: component_find: searching NULL for rmaps components
[puente:218375] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:218375] pmix:mca: base: components_register: registering framework rmaps components
[puente:218375] pmix:mca: base: components_register: found loaded component ppr
[puente:218375] pmix:mca: base: components_register: component ppr register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component rank_file
[puente:218375] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:218375] pmix:mca: base: components_register: found loaded component round_robin
[puente:218375] pmix:mca: base: components_register: component round_robin register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component seq
[puente:218375] pmix:mca: base: components_register: component seq register function successful
[puente:218375] mca: base: components_open: opening rmaps components
[puente:218375] mca: base: components_open: found loaded component ppr
[puente:218375] mca: base: components_open: component ppr open function successful
[puente:218375] mca: base: components_open: found loaded component rank_file
[puente:218375] mca: base: components_open: found loaded component round_robin
[puente:218375] mca: base: components_open: component round_robin open function successful
[puente:218375] mca: base: components_open: found loaded component seq
[puente:218375] mca: base: components_open: component seq open function successful
[puente:218375] mca:rmaps:select: checking available component ppr
[puente:218375] mca:rmaps:select: Querying component [ppr]
[puente:218375] mca:rmaps:select: checking available component rank_file
[puente:218375] mca:rmaps:select: Querying component [rank_file]
[puente:218375] mca:rmaps:select: checking available component round_robin
[puente:218375] mca:rmaps:select: Querying component [round_robin]
[puente:218375] mca:rmaps:select: checking available component seq
[puente:218375] mca:rmaps:select: Querying component [seq]
[puente:218375] [prterun-puente-218375@0,0]: Final mapper priorities
[puente:218375]     Mapper: rank_file Priority: 100
[puente:218375]     Mapper: ppr Priority: 90
[puente:218375]     Mapper: seq Priority: 60
[puente:218375]     Mapper: round_robin Priority: 10
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm creating map
[puente:218375] [prterun-puente-218375@0,0] setup:vm: working unmanaged allocation
[puente:218375] [prterun-puente-218375@0,0] using default hostfile /usr/local/openmpi/etc/prte-default-hostfile
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm only HNP in allocation
[puente:218375] [prterun-puente-218375@0,0] plm:base:setting slots for node puente by core

======================   ALLOCATED NODES   ======================
    puente: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
    aliases: puente
=================================================================

======================   ALLOCATED NODES   ======================
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: puente
=================================================================
[puente:218375] [prterun-puente-218375@0,0] rmaps:base set policy with ppr:1:node
[puente:218375] [prterun-puente-218375@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive processing msg
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive job launch command from [prterun-puente-218375@0,0]
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive adding hosts
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive calling spawn
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive done processing commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_job
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm no new daemons required
[puente:218375] mca:rmaps: mapping job prterun-puente-218375@1
[puente:218375] mca:rmaps: setting mapping policies for job prterun-puente-218375@1 inherit TRUE hwtcpus FALSE
[puente:218375] [prterun-puente-218375@0,0] using known nodes
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375]     node: puente daemon: 0 slots_available: 128

[puente:218375] setdefaultbinding[366] binding not given - using bycore
======================   ALLOCATED NODES   ======================
[puente:218375] mca:rmaps:rf: job prterun-puente-218375@1 not using rankfile policy
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
[puente:218375] mca:rmaps:ppr: mapping job prterun-puente-218375@1 with ppr 1:node
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:218375] mca:rmaps:ppr: job prterun-puente-218375@1 assigned policy BYNODE:SLOT
    aliases: puente
[puente:218375] [prterun-puente-218375@0,0] using known nodes
=================================================================
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375]     node: puente daemon: 0 slots_available: 128
[puente:218375] [prterun-puente-218375@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:218375] mca:rmaps: compute bindings for job prterun-puente-218375@1 with policy CORE:IF-SUPPORTED[1007]
[puente:218375] mca:rmaps: bind [prterun-puente-218375@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:218375] [prterun-puente-218375@0,0] BOUND PROC [prterun-puente-218375@1,INVALID][puente] TO package[0][core:0]
[puente:218375] [prterun-puente-218375@0,0] complete_setup on job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch_apps for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:send launch msg for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch wiring up iof for job prterun-puente-218375@1
puente
[puente:218375] [prterun-puente-218375@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive stop comm
[puente:218375] mca: base: close: component ssh closed
[puente:218375] mca: base: close: unloading component ssh
janjust commented 6 days ago

Sometimes (I don't remember the circumstance) if the LD_LIBRARY_PATH is not forwarded to other nodes hangs and other weird behavior is possible, so this is usually the first I try to rule that out. Looks like your issue is something else.

rhc54 commented 6 days ago

The problem is here: -np 2 --mca pml ucx --map-by ppr:1:node

You only have one node in your system, and you tell us to launch 1 process/node - but ask us to launch TWO procs. Logically impossible. We should have immediately error'd out, so that's the bug - but this cmd cannot succeed.

janjust commented 6 days ago

He has a hostile in his previous command, I assumed two nodes are listed in it.

rhc54 commented 6 days ago

Yeah, it's nearly impossible to triage this one. The cmds keep varying, some are inconsistent with the reported output, etc. Probably need to ask that the user be more careful in what they are reporting.

sdonoso commented 6 days ago

I have two nodes connected by infiniband, and also i can ssh between the nodes without the password prompt.

rhc54 commented 6 days ago

Your reported debug output shows only ONE node in your allocation:

======================   ALLOCATED NODES   ======================
    puente: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
    aliases: puente
=================================================================

Hence the confusion. I think you are perhaps not being careful in showing the results from what is probably a bunch of runs, and the output doesn't always match the posted cmd.

sdonoso commented 6 days ago

sorry, i miss pass the hostfile

mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile nccl-tests/hostfile  --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:220398] mca: base: component_find: searching NULL for plm components
[puente:220398] mca: base: find_dyn_components: checking NULL for plm components
[puente:220398] pmix:mca: base: components_register: registering framework plm components
[puente:220398] pmix:mca: base: components_register: found loaded component slurm
[puente:220398] pmix:mca: base: components_register: component slurm register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component ssh
[puente:220398] pmix:mca: base: components_register: component ssh register function successful
[puente:220398] mca: base: components_open: opening plm components
[puente:220398] mca: base: components_open: found loaded component slurm
[puente:220398] mca: base: components_open: component slurm open function successful
[puente:220398] mca: base: components_open: found loaded component ssh
[puente:220398] mca: base: components_open: component ssh open function successful
[puente:220398] mca:base:select: Auto-selecting plm components
[puente:220398] mca:base:select:(  plm) Querying component [slurm]
[puente:220398] mca:base:select:(  plm) Querying component [ssh]
[puente:220398] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:220398] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:220398] mca:base:select:(  plm) Selected component [ssh]
[puente:220398] mca: base: close: component slurm closed
[puente:220398] mca: base: close: unloading component slurm
[puente:220398] [prterun-puente-220398@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive start comm
[puente:220398] mca: base: component_find: searching NULL for rmaps components
[puente:220398] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:220398] pmix:mca: base: components_register: registering framework rmaps components
[puente:220398] pmix:mca: base: components_register: found loaded component ppr
[puente:220398] pmix:mca: base: components_register: component ppr register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component rank_file
[puente:220398] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:220398] pmix:mca: base: components_register: found loaded component round_robin
[puente:220398] pmix:mca: base: components_register: component round_robin register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component seq
[puente:220398] pmix:mca: base: components_register: component seq register function successful
[puente:220398] mca: base: components_open: opening rmaps components
[puente:220398] mca: base: components_open: found loaded component ppr
[puente:220398] mca: base: components_open: component ppr open function successful
[puente:220398] mca: base: components_open: found loaded component rank_file
[puente:220398] mca: base: components_open: found loaded component round_robin
[puente:220398] mca: base: components_open: component round_robin open function successful
[puente:220398] mca: base: components_open: found loaded component seq
[puente:220398] mca: base: components_open: component seq open function successful
[puente:220398] mca:rmaps:select: checking available component ppr
[puente:220398] mca:rmaps:select: Querying component [ppr]
[puente:220398] mca:rmaps:select: checking available component rank_file
[puente:220398] mca:rmaps:select: Querying component [rank_file]
[puente:220398] mca:rmaps:select: checking available component round_robin
[puente:220398] mca:rmaps:select: Querying component [round_robin]
[puente:220398] mca:rmaps:select: checking available component seq
[puente:220398] mca:rmaps:select: Querying component [seq]
[puente:220398] [prterun-puente-220398@0,0]: Final mapper priorities
[puente:220398]     Mapper: rank_file Priority: 100
[puente:220398]     Mapper: ppr Priority: 90
[puente:220398]     Mapper: seq Priority: 60
[puente:220398]     Mapper: round_robin Priority: 10
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm creating map
[puente:220398] [prterun-puente-220398@0,0] setup:vm: working unmanaged allocation
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] ignoring myself
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm add new daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-220398@0,1] to node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: launching vm

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.83
    146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: local shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: assuming same remote shell as local shell
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: remote shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28"
[puente:220398] [prterun-puente-220398@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: activating launch event
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: recording launch of daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28"]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1] on node kalila
[puente:220398] ALIASES FOR NODE kalila (kalila)
[puente:220398]     ALIAS: 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:220398] [prterun-puente-220398@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-220398@0,1] at contact prterun-puente-220398@0.1;tcp://146.155.155.84:42405:28
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch job prterun-puente-220398@0 recvd 2 of 2 reported daemons

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.83
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.84
=================================================================
[puente:220398] [prterun-puente-220398@0,0] rmaps:base set policy with ppr:1:node
[puente:220398] [prterun-puente-220398@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive job launch command from [prterun-puente-220398@0,0]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive adding hosts
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive calling spawn
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_job
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm no new daemons required
[puente:220398] mca:rmaps: mapping job prterun-puente-220398@1
[puente:220398] mca:rmaps: setting mapping policies for job prterun-puente-220398@1 inherit TRUE hwtcpus FALSE
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
    aliases: 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] AVAILABLE NODES FOR MAPPING:
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398]     node: puente daemon: 0 slots_available: 8
    aliases: 146.155.155.84
[puente:220398]     node: kalila daemon: 1 slots_available: 8
=================================================================
[puente:220398] setdefaultbinding[366] binding not given - using bycore
[puente:220398] mca:rmaps:rf: job prterun-puente-220398@1 not using rankfile policy
[puente:220398] mca:rmaps:ppr: mapping job prterun-puente-220398@1 with ppr 1:node
[puente:220398] mca:rmaps:ppr: job prterun-puente-220398@1 assigned policy BYNODE:SLOT
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
[puente:220398] AVAILABLE NODES FOR MAPPING:
[puente:220398]     node: puente daemon: 0 slots_available: 8
[puente:220398]     node: kalila daemon: 1 slots_available: 8
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][puente] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][kalila] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] complete_setup on job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch_apps for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:send launch msg for job prterun-puente-220398@1
puente
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive local launch complete command from [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch wiring up iof for job prterun-puente-220398@1
kalila
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] 

[prterun-puente-220398@0,0] plm:base:receive update proc state command from [prterun-puente-220398@0,1]

[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for vpid 1 pid 577327 state NORMALLY TERMINATED exit_code 0
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive stop comm
[puente:220398] mca: base: close: component ssh closed
[puente:220398] mca: base: close: unloading component ssh
sdonoso commented 6 days ago

And the output of the hello_world

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -np 2 -hostfile hostfile  --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc ./hello_world
[puente:222235] mca: base: component_find: searching NULL for plm components
[puente:222235] mca: base: find_dyn_components: checking NULL for plm components
[puente:222235] pmix:mca: base: components_register: registering framework plm components
[puente:222235] pmix:mca: base: components_register: found loaded component slurm
[puente:222235] pmix:mca: base: components_register: component slurm register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component ssh
[puente:222235] pmix:mca: base: components_register: component ssh register function successful
[puente:222235] mca: base: components_open: opening plm components
[puente:222235] mca: base: components_open: found loaded component slurm
[puente:222235] mca: base: components_open: component slurm open function successful
[puente:222235] mca: base: components_open: found loaded component ssh
[puente:222235] mca: base: components_open: component ssh open function successful
[puente:222235] mca:base:select: Auto-selecting plm components
[puente:222235] mca:base:select:(  plm) Querying component [slurm]
[puente:222235] mca:base:select:(  plm) Querying component [ssh]
[puente:222235] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:222235] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:222235] mca:base:select:(  plm) Selected component [ssh]
[puente:222235] mca: base: close: component slurm closed
[puente:222235] mca: base: close: unloading component slurm
[puente:222235] [prterun-puente-222235@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive start comm
[puente:222235] mca: base: component_find: searching NULL for rmaps components
[puente:222235] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:222235] pmix:mca: base: components_register: registering framework rmaps components
[puente:222235] pmix:mca: base: components_register: found loaded component ppr
[puente:222235] pmix:mca: base: components_register: component ppr register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component rank_file
[puente:222235] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:222235] pmix:mca: base: components_register: found loaded component round_robin
[puente:222235] pmix:mca: base: components_register: component round_robin register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component seq
[puente:222235] pmix:mca: base: components_register: component seq register function successful
[puente:222235] mca: base: components_open: opening rmaps components
[puente:222235] mca: base: components_open: found loaded component ppr
[puente:222235] mca: base: components_open: component ppr open function successful
[puente:222235] mca: base: components_open: found loaded component rank_file
[puente:222235] mca: base: components_open: found loaded component round_robin
[puente:222235] mca: base: components_open: component round_robin open function successful
[puente:222235] mca: base: components_open: found loaded component seq
[puente:222235] mca: base: components_open: component seq open function successful
[puente:222235] mca:rmaps:select: checking available component ppr
[puente:222235] mca:rmaps:select: Querying component [ppr]
[puente:222235] mca:rmaps:select: checking available component rank_file
[puente:222235] mca:rmaps:select: Querying component [rank_file]
[puente:222235] mca:rmaps:select: checking available component round_robin
[puente:222235] mca:rmaps:select: Querying component [round_robin]
[puente:222235] mca:rmaps:select: checking available component seq
[puente:222235] mca:rmaps:select: Querying component [seq]
[puente:222235] [prterun-puente-222235@0,0]: Final mapper priorities
[puente:222235]     Mapper: rank_file Priority: 100
[puente:222235]     Mapper: ppr Priority: 90
[puente:222235]     Mapper: seq Priority: 60
[puente:222235]     Mapper: round_robin Priority: 10
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm creating map
[puente:222235] [prterun-puente-222235@0,0] setup:vm: working unmanaged allocation
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile

[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.83
======================   ALLOCATED NODES   ======================
[puente:222235] [prterun-puente-222235@0,0] ignoring myself
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.84
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm add new daemon [prterun-puente-222235@0,1]
    aliases: 146.155.155.83
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-222235@0,1] to node 146.155.155.84
    146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
    Flags: SLOTS_GIVEN
    aliases: NONE
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: launching vm
=================================================================
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: local shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: assuming same remote shell as local shell
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: remote shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: final template argv:
    /usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28"
[puente:222235] [prterun-puente-222235@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: activating launch event
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: recording launch of daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28"]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1] on node kalila
[puente:222235] ALIASES FOR NODE kalila (kalila)
[puente:222235]     ALIAS: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:222235] [prterun-puente-222235@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-222235@0,1] at contact prterun-puente-222235@0.1;tcp://146.155.155.84:33749:28
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch job prterun-puente-222235@0 recvd 2 of 2 reported daemons

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.83
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.84
=================================================================
[puente:222235] [prterun-puente-222235@0,0] rmaps:base set policy with ppr:1:node
[puente:222235] [prterun-puente-222235@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive job launch command from [prterun-puente-222235@0,0]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive adding hosts
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive calling spawn
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_job
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm no new daemons required
[puente:222235] mca:rmaps: mapping job prterun-puente-222235@1
[puente:222235] mca:rmaps: setting mapping policies for job prterun-puente-222235@1 inherit TRUE hwtcpus FALSE
[puente:222235] setdefaultbinding[366] binding not given - using bycore
[puente:222235] mca:rmaps:rf: job prterun-puente-222235@1 not using rankfile policy
[puente:222235] mca:rmaps:ppr: mapping job prterun-puente-222235@1 with ppr 1:node
[puente:222235] mca:rmaps:ppr: job prterun-puente-222235@1 assigned policy BYNODE:SLOT
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:222235] NODE puente DOESNT MATCH NODE 146.155.155.84

[puente:222235] [prterun-puente-222235@0,0] node puente has 8 slots available
======================   ALLOCATED NODES   ======================
[puente:222235] [prterun-puente-222235@0,0] node kalila has 8 slots available
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] AVAILABLE NODES FOR MAPPING:
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235]     node: puente daemon: 0 slots_available: 8
    aliases: 146.155.155.83
[puente:222235]     node: kalila daemon: 1 slots_available: 8
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node puente has 0 procs on it
=================================================================
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][puente] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][kalila] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] complete_setup on job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch_apps for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:send launch msg for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive local launch complete command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch wiring up iof for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive registered command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch prterun-puente-222235@1 registered
Hello world from rank 1 out of 2 processors
janjust commented 1 day ago

I'm not really sure how to debug this further - I cannot reproduce this locally or on any of our other machines