Open sdonoso opened 1 week ago
@sdonoso We have ingested multiple runtime fixes since 5.0.1. Can you reproduce on 5.0.3?
For example, we fixed this a while ago https://github.com/open-mpi/ompi/issues/12064
I am using version 5.0.3
rene@puente:~/nccl-tests$ mpirun --version
mpirun (Open MPI) 5.0.3
Thanks. I updated the issue title.
@sdonoso any chance your LD_LIBRARY_PATH isn't propagated to the other node? If you add your MPI libs into the path and forward it via -x LD_LIBRARY_PATH?
what do you mean with that the LD_LIBRARY_PATH is not propagated?
rene@puente:~/nccl-tests$ mpirun -x LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile hostfile -np 2 --mca pml ucx --map-by ppr:1:node ./hello_world
Hello world from rank 1 out of 2 processors
I have the same result
rene@puente:~/nccl-tests$ mpirun -x LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:218375] mca: base: component_find: searching NULL for plm components
[puente:218375] mca: base: find_dyn_components: checking NULL for plm components
[puente:218375] pmix:mca: base: components_register: registering framework plm components
[puente:218375] pmix:mca: base: components_register: found loaded component slurm
[puente:218375] pmix:mca: base: components_register: component slurm register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component ssh
[puente:218375] pmix:mca: base: components_register: component ssh register function successful
[puente:218375] mca: base: components_open: opening plm components
[puente:218375] mca: base: components_open: found loaded component slurm
[puente:218375] mca: base: components_open: component slurm open function successful
[puente:218375] mca: base: components_open: found loaded component ssh
[puente:218375] mca: base: components_open: component ssh open function successful
[puente:218375] mca:base:select: Auto-selecting plm components
[puente:218375] mca:base:select:( plm) Querying component [slurm]
[puente:218375] mca:base:select:( plm) Querying component [ssh]
[puente:218375] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:218375] mca:base:select:( plm) Query of component [ssh] set priority to 10
[puente:218375] mca:base:select:( plm) Selected component [ssh]
[puente:218375] mca: base: close: component slurm closed
[puente:218375] mca: base: close: unloading component slurm
[puente:218375] [prterun-puente-218375@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive start comm
[puente:218375] mca: base: component_find: searching NULL for rmaps components
[puente:218375] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:218375] pmix:mca: base: components_register: registering framework rmaps components
[puente:218375] pmix:mca: base: components_register: found loaded component ppr
[puente:218375] pmix:mca: base: components_register: component ppr register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component rank_file
[puente:218375] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:218375] pmix:mca: base: components_register: found loaded component round_robin
[puente:218375] pmix:mca: base: components_register: component round_robin register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component seq
[puente:218375] pmix:mca: base: components_register: component seq register function successful
[puente:218375] mca: base: components_open: opening rmaps components
[puente:218375] mca: base: components_open: found loaded component ppr
[puente:218375] mca: base: components_open: component ppr open function successful
[puente:218375] mca: base: components_open: found loaded component rank_file
[puente:218375] mca: base: components_open: found loaded component round_robin
[puente:218375] mca: base: components_open: component round_robin open function successful
[puente:218375] mca: base: components_open: found loaded component seq
[puente:218375] mca: base: components_open: component seq open function successful
[puente:218375] mca:rmaps:select: checking available component ppr
[puente:218375] mca:rmaps:select: Querying component [ppr]
[puente:218375] mca:rmaps:select: checking available component rank_file
[puente:218375] mca:rmaps:select: Querying component [rank_file]
[puente:218375] mca:rmaps:select: checking available component round_robin
[puente:218375] mca:rmaps:select: Querying component [round_robin]
[puente:218375] mca:rmaps:select: checking available component seq
[puente:218375] mca:rmaps:select: Querying component [seq]
[puente:218375] [prterun-puente-218375@0,0]: Final mapper priorities
[puente:218375] Mapper: rank_file Priority: 100
[puente:218375] Mapper: ppr Priority: 90
[puente:218375] Mapper: seq Priority: 60
[puente:218375] Mapper: round_robin Priority: 10
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm creating map
[puente:218375] [prterun-puente-218375@0,0] setup:vm: working unmanaged allocation
[puente:218375] [prterun-puente-218375@0,0] using default hostfile /usr/local/openmpi/etc/prte-default-hostfile
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm only HNP in allocation
[puente:218375] [prterun-puente-218375@0,0] plm:base:setting slots for node puente by core
====================== ALLOCATED NODES ======================
puente: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
aliases: puente
=================================================================
====================== ALLOCATED NODES ======================
puente: slots=128 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: puente
=================================================================
[puente:218375] [prterun-puente-218375@0,0] rmaps:base set policy with ppr:1:node
[puente:218375] [prterun-puente-218375@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive processing msg
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive job launch command from [prterun-puente-218375@0,0]
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive adding hosts
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive calling spawn
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive done processing commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_job
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm no new daemons required
[puente:218375] mca:rmaps: mapping job prterun-puente-218375@1
[puente:218375] mca:rmaps: setting mapping policies for job prterun-puente-218375@1 inherit TRUE hwtcpus FALSE
[puente:218375] [prterun-puente-218375@0,0] using known nodes
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375] node: puente daemon: 0 slots_available: 128
[puente:218375] setdefaultbinding[366] binding not given - using bycore
====================== ALLOCATED NODES ======================
[puente:218375] mca:rmaps:rf: job prterun-puente-218375@1 not using rankfile policy
puente: slots=128 max_slots=0 slots_inuse=0 state=UP
[puente:218375] mca:rmaps:ppr: mapping job prterun-puente-218375@1 with ppr 1:node
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:218375] mca:rmaps:ppr: job prterun-puente-218375@1 assigned policy BYNODE:SLOT
aliases: puente
[puente:218375] [prterun-puente-218375@0,0] using known nodes
=================================================================
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375] node: puente daemon: 0 slots_available: 128
[puente:218375] [prterun-puente-218375@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:218375] mca:rmaps: compute bindings for job prterun-puente-218375@1 with policy CORE:IF-SUPPORTED[1007]
[puente:218375] mca:rmaps: bind [prterun-puente-218375@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:218375] [prterun-puente-218375@0,0] BOUND PROC [prterun-puente-218375@1,INVALID][puente] TO package[0][core:0]
[puente:218375] [prterun-puente-218375@0,0] complete_setup on job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch_apps for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:send launch msg for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch wiring up iof for job prterun-puente-218375@1
puente
[puente:218375] [prterun-puente-218375@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive stop comm
[puente:218375] mca: base: close: component ssh closed
[puente:218375] mca: base: close: unloading component ssh
Sometimes (I don't remember the circumstance) if the LD_LIBRARY_PATH is not forwarded to other nodes hangs and other weird behavior is possible, so this is usually the first I try to rule that out. Looks like your issue is something else.
The problem is here: -np 2 --mca pml ucx --map-by ppr:1:node
You only have one node in your system, and you tell us to launch 1 process/node - but ask us to launch TWO procs. Logically impossible. We should have immediately error'd out, so that's the bug - but this cmd cannot succeed.
He has a hostile in his previous command, I assumed two nodes are listed in it.
Yeah, it's nearly impossible to triage this one. The cmds keep varying, some are inconsistent with the reported output, etc. Probably need to ask that the user be more careful in what they are reporting.
I have two nodes connected by infiniband, and also i can ssh between the nodes without the password prompt.
Your reported debug output shows only ONE node in your allocation:
====================== ALLOCATED NODES ======================
puente: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
aliases: puente
=================================================================
Hence the confusion. I think you are perhaps not being careful in showing the results from what is probably a bunch of runs, and the output doesn't always match the posted cmd.
sorry, i miss pass the hostfile
mpirun -x LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile nccl-tests/hostfile --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:220398] mca: base: component_find: searching NULL for plm components
[puente:220398] mca: base: find_dyn_components: checking NULL for plm components
[puente:220398] pmix:mca: base: components_register: registering framework plm components
[puente:220398] pmix:mca: base: components_register: found loaded component slurm
[puente:220398] pmix:mca: base: components_register: component slurm register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component ssh
[puente:220398] pmix:mca: base: components_register: component ssh register function successful
[puente:220398] mca: base: components_open: opening plm components
[puente:220398] mca: base: components_open: found loaded component slurm
[puente:220398] mca: base: components_open: component slurm open function successful
[puente:220398] mca: base: components_open: found loaded component ssh
[puente:220398] mca: base: components_open: component ssh open function successful
[puente:220398] mca:base:select: Auto-selecting plm components
[puente:220398] mca:base:select:( plm) Querying component [slurm]
[puente:220398] mca:base:select:( plm) Querying component [ssh]
[puente:220398] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:220398] mca:base:select:( plm) Query of component [ssh] set priority to 10
[puente:220398] mca:base:select:( plm) Selected component [ssh]
[puente:220398] mca: base: close: component slurm closed
[puente:220398] mca: base: close: unloading component slurm
[puente:220398] [prterun-puente-220398@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive start comm
[puente:220398] mca: base: component_find: searching NULL for rmaps components
[puente:220398] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:220398] pmix:mca: base: components_register: registering framework rmaps components
[puente:220398] pmix:mca: base: components_register: found loaded component ppr
[puente:220398] pmix:mca: base: components_register: component ppr register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component rank_file
[puente:220398] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:220398] pmix:mca: base: components_register: found loaded component round_robin
[puente:220398] pmix:mca: base: components_register: component round_robin register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component seq
[puente:220398] pmix:mca: base: components_register: component seq register function successful
[puente:220398] mca: base: components_open: opening rmaps components
[puente:220398] mca: base: components_open: found loaded component ppr
[puente:220398] mca: base: components_open: component ppr open function successful
[puente:220398] mca: base: components_open: found loaded component rank_file
[puente:220398] mca: base: components_open: found loaded component round_robin
[puente:220398] mca: base: components_open: component round_robin open function successful
[puente:220398] mca: base: components_open: found loaded component seq
[puente:220398] mca: base: components_open: component seq open function successful
[puente:220398] mca:rmaps:select: checking available component ppr
[puente:220398] mca:rmaps:select: Querying component [ppr]
[puente:220398] mca:rmaps:select: checking available component rank_file
[puente:220398] mca:rmaps:select: Querying component [rank_file]
[puente:220398] mca:rmaps:select: checking available component round_robin
[puente:220398] mca:rmaps:select: Querying component [round_robin]
[puente:220398] mca:rmaps:select: checking available component seq
[puente:220398] mca:rmaps:select: Querying component [seq]
[puente:220398] [prterun-puente-220398@0,0]: Final mapper priorities
[puente:220398] Mapper: rank_file Priority: 100
[puente:220398] Mapper: ppr Priority: 90
[puente:220398] Mapper: seq Priority: 60
[puente:220398] Mapper: round_robin Priority: 10
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm creating map
[puente:220398] [prterun-puente-220398@0,0] setup:vm: working unmanaged allocation
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] ignoring myself
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm add new daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-220398@0,1] to node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: launching vm
====================== ALLOCATED NODES ======================
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.83
146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
Flags: SLOTS_GIVEN
aliases: NONE
=================================================================
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: local shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: assuming same remote shell as local shell
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: remote shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28"
[puente:220398] [prterun-puente-220398@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: activating launch event
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: recording launch of daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-220398@0.0;tcp://146.155.155.83:42011:28"]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1] on node kalila
[puente:220398] ALIASES FOR NODE kalila (kalila)
[puente:220398] ALIAS: 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:220398] [prterun-puente-220398@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-220398@0,1] at contact prterun-puente-220398@0.1;tcp://146.155.155.84:42405:28
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch job prterun-puente-220398@0 recvd 2 of 2 reported daemons
====================== ALLOCATED NODES ======================
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.83
kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.84
=================================================================
[puente:220398] [prterun-puente-220398@0,0] rmaps:base set policy with ppr:1:node
[puente:220398] [prterun-puente-220398@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive job launch command from [prterun-puente-220398@0,0]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive adding hosts
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive calling spawn
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_job
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm no new daemons required
[puente:220398] mca:rmaps: mapping job prterun-puente-220398@1
[puente:220398] mca:rmaps: setting mapping policies for job prterun-puente-220398@1 inherit TRUE hwtcpus FALSE
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
====================== ALLOCATED NODES ======================
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
aliases: 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] AVAILABLE NODES FOR MAPPING:
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398] node: puente daemon: 0 slots_available: 8
aliases: 146.155.155.84
[puente:220398] node: kalila daemon: 1 slots_available: 8
=================================================================
[puente:220398] setdefaultbinding[366] binding not given - using bycore
[puente:220398] mca:rmaps:rf: job prterun-puente-220398@1 not using rankfile policy
[puente:220398] mca:rmaps:ppr: mapping job prterun-puente-220398@1 with ppr 1:node
[puente:220398] mca:rmaps:ppr: job prterun-puente-220398@1 assigned policy BYNODE:SLOT
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
[puente:220398] AVAILABLE NODES FOR MAPPING:
[puente:220398] node: puente daemon: 0 slots_available: 8
[puente:220398] node: kalila daemon: 1 slots_available: 8
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][puente] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][kalila] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] complete_setup on job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch_apps for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:send launch msg for job prterun-puente-220398@1
puente
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive local launch complete command from [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch wiring up iof for job prterun-puente-220398@1
kalila
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398]
[prterun-puente-220398@0,0] plm:base:receive update proc state command from [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for vpid 1 pid 577327 state NORMALLY TERMINATED exit_code 0
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive stop comm
[puente:220398] mca: base: close: component ssh closed
[puente:220398] mca: base: close: unloading component ssh
And the output of the hello_world
rene@puente:~/nccl-tests$ mpirun -x LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -np 2 -hostfile hostfile --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc ./hello_world
[puente:222235] mca: base: component_find: searching NULL for plm components
[puente:222235] mca: base: find_dyn_components: checking NULL for plm components
[puente:222235] pmix:mca: base: components_register: registering framework plm components
[puente:222235] pmix:mca: base: components_register: found loaded component slurm
[puente:222235] pmix:mca: base: components_register: component slurm register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component ssh
[puente:222235] pmix:mca: base: components_register: component ssh register function successful
[puente:222235] mca: base: components_open: opening plm components
[puente:222235] mca: base: components_open: found loaded component slurm
[puente:222235] mca: base: components_open: component slurm open function successful
[puente:222235] mca: base: components_open: found loaded component ssh
[puente:222235] mca: base: components_open: component ssh open function successful
[puente:222235] mca:base:select: Auto-selecting plm components
[puente:222235] mca:base:select:( plm) Querying component [slurm]
[puente:222235] mca:base:select:( plm) Querying component [ssh]
[puente:222235] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:222235] mca:base:select:( plm) Query of component [ssh] set priority to 10
[puente:222235] mca:base:select:( plm) Selected component [ssh]
[puente:222235] mca: base: close: component slurm closed
[puente:222235] mca: base: close: unloading component slurm
[puente:222235] [prterun-puente-222235@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive start comm
[puente:222235] mca: base: component_find: searching NULL for rmaps components
[puente:222235] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:222235] pmix:mca: base: components_register: registering framework rmaps components
[puente:222235] pmix:mca: base: components_register: found loaded component ppr
[puente:222235] pmix:mca: base: components_register: component ppr register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component rank_file
[puente:222235] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:222235] pmix:mca: base: components_register: found loaded component round_robin
[puente:222235] pmix:mca: base: components_register: component round_robin register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component seq
[puente:222235] pmix:mca: base: components_register: component seq register function successful
[puente:222235] mca: base: components_open: opening rmaps components
[puente:222235] mca: base: components_open: found loaded component ppr
[puente:222235] mca: base: components_open: component ppr open function successful
[puente:222235] mca: base: components_open: found loaded component rank_file
[puente:222235] mca: base: components_open: found loaded component round_robin
[puente:222235] mca: base: components_open: component round_robin open function successful
[puente:222235] mca: base: components_open: found loaded component seq
[puente:222235] mca: base: components_open: component seq open function successful
[puente:222235] mca:rmaps:select: checking available component ppr
[puente:222235] mca:rmaps:select: Querying component [ppr]
[puente:222235] mca:rmaps:select: checking available component rank_file
[puente:222235] mca:rmaps:select: Querying component [rank_file]
[puente:222235] mca:rmaps:select: checking available component round_robin
[puente:222235] mca:rmaps:select: Querying component [round_robin]
[puente:222235] mca:rmaps:select: checking available component seq
[puente:222235] mca:rmaps:select: Querying component [seq]
[puente:222235] [prterun-puente-222235@0,0]: Final mapper priorities
[puente:222235] Mapper: rank_file Priority: 100
[puente:222235] Mapper: ppr Priority: 90
[puente:222235] Mapper: seq Priority: 60
[puente:222235] Mapper: round_robin Priority: 10
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm creating map
[puente:222235] [prterun-puente-222235@0,0] setup:vm: working unmanaged allocation
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.83
====================== ALLOCATED NODES ======================
[puente:222235] [prterun-puente-222235@0,0] ignoring myself
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.84
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm add new daemon [prterun-puente-222235@0,1]
aliases: 146.155.155.83
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-222235@0,1] to node 146.155.155.84
146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
Flags: SLOTS_GIVEN
aliases: NONE
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: launching vm
=================================================================
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: local shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: assuming same remote shell as local shell
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: remote shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: final template argv:
/usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28"
[puente:222235] [prterun-puente-222235@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: activating launch event
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: recording launch of daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-puente-222235@0.0;tcp://146.155.155.83:39027:28"]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1] on node kalila
[puente:222235] ALIASES FOR NODE kalila (kalila)
[puente:222235] ALIAS: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:222235] [prterun-puente-222235@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-222235@0,1] at contact prterun-puente-222235@0.1;tcp://146.155.155.84:33749:28
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch job prterun-puente-222235@0 recvd 2 of 2 reported daemons
====================== ALLOCATED NODES ======================
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.83
kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.84
=================================================================
[puente:222235] [prterun-puente-222235@0,0] rmaps:base set policy with ppr:1:node
[puente:222235] [prterun-puente-222235@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive job launch command from [prterun-puente-222235@0,0]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive adding hosts
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive calling spawn
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_job
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm no new daemons required
[puente:222235] mca:rmaps: mapping job prterun-puente-222235@1
[puente:222235] mca:rmaps: setting mapping policies for job prterun-puente-222235@1 inherit TRUE hwtcpus FALSE
[puente:222235] setdefaultbinding[366] binding not given - using bycore
[puente:222235] mca:rmaps:rf: job prterun-puente-222235@1 not using rankfile policy
[puente:222235] mca:rmaps:ppr: mapping job prterun-puente-222235@1 with ppr 1:node
[puente:222235] mca:rmaps:ppr: job prterun-puente-222235@1 assigned policy BYNODE:SLOT
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:222235] NODE puente DOESNT MATCH NODE 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] node puente has 8 slots available
====================== ALLOCATED NODES ======================
[puente:222235] [prterun-puente-222235@0,0] node kalila has 8 slots available
puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] AVAILABLE NODES FOR MAPPING:
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235] node: puente daemon: 0 slots_available: 8
aliases: 146.155.155.83
[puente:222235] node: kalila daemon: 1 slots_available: 8
kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node puente has 0 procs on it
=================================================================
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][puente] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][kalila] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] complete_setup on job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch_apps for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:send launch msg for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive local launch complete command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch wiring up iof for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive registered command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch prterun-puente-222235@1 registered
Hello world from rank 1 out of 2 processors
I'm not really sure how to debug this further - I cannot reproduce this locally or on any of our other machines
when I try to run the following command the process hangs and does not finish
The Environment I have two nodes with the next specs
i can ssh do between nodes. i can ping between nodes.
output from For problems launching MPI or OpenSHMEM applications