Open HenrikBengtsson opened 1 year ago
@HenrikBengtsson This seems to run the $cmd serially on each of the hosts instead of in parallel
Do you know of an an easy way to run these in parallel? I tried running the $cmd
(which i replaced with my python job) in background but it looks like it gets killed immediatly (assuming as exit of qrsh?)
This seems to run the $cmd serially on each of the hosts instead of in parallel
Oh... yes, you're right.
Do you know of an an easy way to run these in parallel? ...
We can use standard shell tools for this, i.e. &
and wait. Calling
somecmd &will run
somecmdin the background, and
wait` will wait for all such tasks to complete. Here is an updated version:
#!/bin/env bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
echo "Call: $0 ..."
echo "Script name: $(basename "${BASH_SOURCE[0]}")"
echo "Arguments: $*"
echo "PID: ${PID}"
module load CBI r
Rscript demo_pe_mpi_qrsh.R
#' Reads PE_HOSTFILE and returns an array of hostnames, where each
#' hostname is repeated the number of times per second column.
#' For example,
#'
#' opt88 3 short.q@opt88 UNDEFINED
#' iq242 2 short.q@iq242 UNDEFINED
#' opt116 1 short.q@opt116 UNDEFINED
#'
#' returns array (opt88 opt88 opt88 iq242 iq242 opt116)
read_pe_hostfile_expanded() {
local -a hosts rows args
local row kk
[[ -n "$PE_HOSTFILE" ]] || { >&2 echo "ERROR: Environment variable 'PE_HOSTFILE' is not set"; exit 1; }
[[ -f "$PE_HOSTFILE" ]] || { >&2 echo "ERROR: No such file: ${PE_HOSTFILE}"; exit 1; }
## Parse PE_HOSTFILE file
mapfile -t rows < <(cat "$PE_HOSTFILE")
for row in "${rows[@]}"; do
read -r -a args <<< "${row}"
# shellcheck disable=SC2034
for kk in $(seq "${args[1]}"); do
hosts+=("${args[0]}")
done
done
echo "${hosts[@]}"
}
read -r -a hosts < <(read_pe_hostfile_expanded)
#echo "hosts=${hosts[*]}"
#echo "nhosts=${#hosts[@]}"
cmd='echo "begin"; hostname; date; echo "done"'
echo "Launching ${#hosts[@]} parallel tasks ..."
echo " - task: $cmd"
for host in "${hosts[@]}"; do
echo "- launch: qrsh -inherit -nostdin -V ${host} \"$cmd\" &"
qrsh -inherit -nostdin -V "${host}" "$cmd" &
done
echo "Launching ${#hosts[@]} parallel tasks ... done"
## Wait for all tasks to complete
echo "Waiting for ${#hosts[@]} parallel tasks to complete ..."
wait
echo "Waiting for ${#hosts[@]} parallel tasks to complete ... done"
## End-of-job summary, if running as a job
[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID" # This is useful for debugging and usage purposes,
# e.g. "did my job exceed its memory request?"
echo "Call: $0 ... done"
It's probably useful to put all that into a new shell function qrshrun
to make it neater. I'll do that next.
Here's the version with a qrsh_run
function to better clarify how it works:
#!/bin/env bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#-----------------------------------------------------------------
# SGE utility functions
#-----------------------------------------------------------------
sge_debug() {
${SGE_DEBUG:-false} && >&2 echo "$@"
}
#' Reads PE_HOSTFILE and returns an array of hostnames, where each
#' hostname is repeated the number of times per second column.
#' For example,
#'
#' opt88 3 short.q@opt88 UNDEFINED
#' iq242 2 short.q@iq242 UNDEFINED
#' opt116 1 short.q@opt116 UNDEFINED
#'
#' returns array (opt88 opt88 opt88 iq242 iq242 opt116)
read_pe_hostfile_expanded() {
local -a hosts rows args
local row kk
[[ -n "$PE_HOSTFILE" ]] || { >&2 echo "ERROR: Environment variable 'PE_HOSTFILE' is not set"; exit 1; }
[[ -f "$PE_HOSTFILE" ]] || { >&2 echo "ERROR: No such file: ${PE_HOSTFILE}"; exit 1; }
## Parse PE_HOSTFILE file
mapfile -t rows < <(cat "$PE_HOSTFILE")
for row in "${rows[@]}"; do
read -r -a args <<< "${row}"
# shellcheck disable=SC2034
for kk in $(seq "${args[1]}"); do
hosts+=("${args[0]}")
done
done
echo "${hosts[@]}"
}
#' Calls a command on parallel workers allotted by SGE
#'
#' This function identifies the parallel workers that SGE has
#' given to the current job by parsing the file given by the
#' 'PE_HOSTFILE' environment variable. It then uses:
#'
#' qrsh -inherit -nostdin -V <worker-hostname> <command>
#'
#' to launch the <command> on each parallel worker.
#'
#' Example:
#' qrsh_run 'echo "begin"; hostname; date; echo "done"'
qrsh_run() {
local -a hosts
read -r -a hosts < <(read_pe_hostfile_expanded)
## Nothing to do?
[[ ${#hosts[@]} == 0 ]] && return 0
sge_debug "Launching ${#hosts[@]} parallel tasks ..."
sge_debug " - task: $*"
for host in "${hosts[@]}"; do
sge_debug "- launch: qrsh -inherit -nostdin -V ${host} \"$*\" &"
qrsh -inherit -nostdin -V "${host}" "$@" &
done
sge_debug "Launching ${#hosts[@]} parallel tasks ... done"
## Wait for all tasks to complete
sge_debug "Waiting for ${#hosts[@]} parallel tasks to complete ..."
wait
sge_debug "Waiting for ${#hosts[@]} parallel tasks to complete ... done"
}
#-----------------------------------------------------------------
# Main script
#-----------------------------------------------------------------
echo "Call: $0 ..."
echo "Script name: $(basename "${BASH_SOURCE[0]}")"
echo "Arguments: $*"
echo "PPID: ${PPID}"
## Launch command on all parallel workers allotted by SGE
qrsh_run 'echo "begin"; hostname; date; echo "done"'
## Launch another set of parallel tasks after the above have completed
qrsh_run 'echo "begin 2nd round"; hostname; date; echo "done"'
## End-of-job summary, if running as a job
[[ -n "$JOB_ID" ]] && qstat -j "$JOB_ID" # This is useful for debugging and usage purposes,
# e.g. "did my job exceed its memory request?"
echo "Call: $0 ... done"
A user said in an email:
Coincidentally, a few weeks ago, I figured out how to launch mult-host subprocesses using
qrsh
instead ofmpirun
. Here's an example - it would be nice to be able to simplify it more: