I have created a minimal example regarding a persistent error resulting in pipeline crashes on SGE associated with the generated nxf_kill function in .command.run. I have attached two files to reproduce it consistently on my cluster, a test.nf file and a nextflow.config file. Specifically the error always points to line 43 of the script:
children[$PP]+=" $P"
in
nxf_kill() {
declare -a children
while read P PP;do
children[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
kill_all() {
[[ $1 != $$ ]] && kill $1 2>/dev/null || true
for i in ${children[$1]:=}; do kill_all $i; done
}
kill_all $1
}
Expected behavior and actual behavior
The workflow consists of a single process that takes 15 seconds to complete (basically a sleep 15 and creation of a dummy file). I schedule 500 of these processes using Nextflow and a time limit of: '10s' * task.attempt. Notably this limit should result in a retry (exit 140) on the first process execution and complete on either the second or third attempt. However, an exit status 1 is thrown occasionally resulting in workflow crashes.
Steps to reproduce the problem
Use the latest Nextflow version 24.04.2.
Copy the two files provided anywhere in the same directory
run: nextflow run test.nf
Program output (.command.log content)
Signal 12 (USR2) caught by ps (procps-ng version 3.3.10)
/var/spool/gridscheduler/execd/node2d21/job_scripts/44417159: line 43: 1 0: syntax error in expression (error token is "0")
Environment
Nextflow version: 24.04.2
Java version: openjdk version "17.0.6" 2023-01-17 LTS
Operating system: Linux
Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
I have asked chatGPT about the error, sorry if this is completely stupid but it might help so I include it just in case:
nxf_kill() {
declare -A children
while read -r P PP; do
# Check if P and PP are integers
if [[ $P =~ ^[0-9]+$ && $PP =~ ^[0-9]+$ ]]; then
children[$PP]+=" $P"
fi
done < <(ps -e -o pid= -o ppid=)
kill_all() {
local pid=$1
if [[ $pid != $$ ]]; then
kill "$pid" 2>/dev/null || true
fi
for child in ${children[$pid]:=}; do
kill_all "$child"
done
}
kill_all "$1"
}
Use declare -A for associative arrays: This ensures that the children array behaves correctly.
Check for integer values before assigning to the array to avoid unexpected values.
Use local for the pid variable in the kill_all function to ensure proper scope handling.
Add -r option to read to prevent backslash escapes from being interpreted.
Bug report
I have created a minimal example regarding a persistent error resulting in pipeline crashes on SGE associated with the generated
nxf_kill
function in.command.run
. I have attached two files to reproduce it consistently on my cluster, a test.nf file and a nextflow.config file. Specifically the error always points to line 43 of the script:in
Expected behavior and actual behavior
The workflow consists of a single process that takes 15 seconds to complete (basically a sleep 15 and creation of a dummy file). I schedule 500 of these processes using Nextflow and a time limit of: '10s' * task.attempt. Notably this limit should result in a retry (exit 140) on the first process execution and complete on either the second or third attempt. However, an exit status 1 is thrown occasionally resulting in workflow crashes.
Steps to reproduce the problem
Program output (.command.log content)
Environment
Additional context
nextflow_issue.zip
I have asked chatGPT about the error, sorry if this is completely stupid but it might help so I include it just in case: