nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.61k stars 605 forks source link

Error in nxf_kill #5078

Open olivierlabayle opened 1 week ago

olivierlabayle commented 1 week ago

Bug report

I have created a minimal example regarding a persistent error resulting in pipeline crashes on SGE associated with the generated nxf_kill function in .command.run. I have attached two files to reproduce it consistently on my cluster, a test.nf file and a nextflow.config file. Specifically the error always points to line 43 of the script:

children[$PP]+=" $P"

in

nxf_kill() {
    declare -a children
    while read P PP;do
        children[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    kill_all() {
        [[ $1 != $$ ]] && kill $1 2>/dev/null || true
        for i in ${children[$1]:=}; do kill_all $i; done
    }

    kill_all $1
}

Expected behavior and actual behavior

The workflow consists of a single process that takes 15 seconds to complete (basically a sleep 15 and creation of a dummy file). I schedule 500 of these processes using Nextflow and a time limit of: '10s' * task.attempt. Notably this limit should result in a retry (exit 140) on the first process execution and complete on either the second or third attempt. However, an exit status 1 is thrown occasionally resulting in workflow crashes.

Steps to reproduce the problem

Program output (.command.log content)

Signal 12 (USR2) caught by ps (procps-ng version 3.3.10)
/var/spool/gridscheduler/execd/node2d21/job_scripts/44417159: line 43: 1 0: syntax error in expression (error token is "0")

Environment

Additional context

nextflow_issue.zip

I have asked chatGPT about the error, sorry if this is completely stupid but it might help so I include it just in case:

nxf_kill() {
    declare -A children

    while read -r P PP; do
        # Check if P and PP are integers
        if [[ $P =~ ^[0-9]+$ && $PP =~ ^[0-9]+$ ]]; then
            children[$PP]+=" $P"
        fi
    done < <(ps -e -o pid= -o ppid=)

    kill_all() {
        local pid=$1
        if [[ $pid != $$ ]]; then
            kill "$pid" 2>/dev/null || true
        fi
        for child in ${children[$pid]:=}; do
            kill_all "$child"
        done
    }

    kill_all "$1"
}