nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 623 forks source link

K8s error message on process error #687

Closed wikiselev closed 6 years ago

wikiselev commented 6 years ago

Hi Paolo, here is another nextflow.log file on K8s showing the case when the process had enough resources but failed. Not sure what the reason was though... Here are all the files from work dirctory:

.command.sh:

#!/bin/bash -euo pipefail
picard MarkDuplicates \
    INPUT=22028_2#118_1Aligned.sortedByCoord.out.bam \
    OUTPUT=22028_2#118_1Aligned.sortedByCoord.out.markDups.bam \
    METRICS_FILE=22028_2#118_1Aligned.sortedByCoord.out.markDups_metrics.txt \
    REMOVE_DUPLICATES=false \
    ASSUME_SORTED=true \
    PROGRAM_RECORD_ID='null' \
    VALIDATION_STRINGENCY=LENIENT
samtools index 22028_2#118_1Aligned.sortedByCoord.out.markDups.bam

.command.run:

#!/bin/bash
# NEXTFLOW TASK: markDuplicates (22028_2#118_1AlignedByCoord.out)
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x

nxf_env() {
    echo '============= task environment ============='
    env | sort | sed "s/\(.*\)AWS\(.*\)=\(.\{6\}\).*/\1AWS\2=\3xxxxxxxxxxxxx/"
    echo '============= task output =================='
}

nxf_kill() {
    declare -a ALL_CHILD
    while read P PP;do
        ALL_CHILD[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    walk() {
        [[ $1 != $$ ]] && kill $1 2>/dev/null || true
        for i in ${ALL_CHILD[$1]:=}; do walk $i; done
    }

    walk $1
}

nxf_mktemp() {
    local base=${1:-/tmp}
    if [[ $base == /dev/shm && ! -d $base ]]; then base=/tmp; fi
    if [[ $(uname) = Darwin ]]; then mktemp -d $base/nxf.XXXXXXXXXX
    else TMPDIR="$base" mktemp -d -t nxf.XXXXXXXXXX
    fi
}

on_exit() {
  exit_status=${ret:=$?}
  printf $exit_status > /mnt/gluster/ubuntu/work/eb/853c8010b8e173b23d8d15489d1a31/.exitcode
  set +u
  [[ "$tee1" ]] && kill $tee1 2>/dev/null
  [[ "$tee2" ]] && kill $tee2 2>/dev/null
  [[ "$ctmp" ]] && rm -rf $ctmp || true
  exit $exit_status
}

on_term() {
    set +e
    [[ "$pid" ]] && nxf_kill $pid
}

trap on_exit EXIT
trap on_term TERM INT USR1 USR2

NXF_SCRATCH=''
[[ $NXF_DEBUG > 0 ]] && nxf_env
touch /mnt/gluster/ubuntu/work/eb/853c8010b8e173b23d8d15489d1a31/.command.begin
# task environment
export PATH="/mnt/gluster/projects/wikiselev/rnaseq/bin:$PATH"

[[ $NXF_SCRATCH ]] && echo "nxf-scratch-dir $HOSTNAME:$NXF_SCRATCH" && cd $NXF_SCRATCH
# stage input files
rm -f 22028_2#118_1Aligned.sortedByCoord.out.bam
ln -s /mnt/gluster/ubuntu/work/78/4152dc659102abb54fbf8deb0274c1/22028_2#118_1Aligned.sortedByCoord.out.bam 22028_2#118_1Aligned.sortedByCoord.out.bam

set +e
ctmp=$(nxf_mktemp /dev/shm)
cout=$ctmp/.command.out; mkfifo $cout
cerr=$ctmp/.command.err; mkfifo $cerr
tee .command.out < $cout &
tee1=$!
tee .command.err < $cerr >&2 &
tee2=$!
(
/bin/bash /mnt/gluster/ubuntu/work/eb/853c8010b8e173b23d8d15489d1a31/.command.stub
) >$cout 2>$cerr &
pid=$!
wait $pid || ret=$?
wait $tee1 $tee2

.command.stub:

#!/bin/bash
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 2 ]] && set -x

nxf_kill() {
    declare -a ALL_CHILD
    while read P PP;do
        ALL_CHILD[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    walk() {
        [[ $1 != $$ ]] && kill $1 2>/dev/null || true
        for i in ${ALL_CHILD[$1]:=}; do walk $i; done
    }

    walk $1
}

nxf_tree() {
    declare -a ALL_CHILD
    while read P PP;do
        ALL_CHILD[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    stat() {
        local x_ps=$(ps -o pid= -o state= -o pcpu= -o pmem= -o vsz= -o rss= $1)
        local x_io=$(cat /proc/$1/io 2> /dev/null | sed 's/^.*:\s*//' | tr '\n' ' ')
        local x_vm=$(cat /proc/$1/status 2> /dev/null | egrep 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ')
        [[ ! $x_ps ]] && return 0

        printf "$x_ps"
        if [[ $x_vm ]]; then printf " $x_vm"; else printf " 0 0"; fi
        if [[ $x_io ]]; then printf " $x_io"; fi
        printf "\n"
    }

    walk() {
        stat $1
        for i in ${ALL_CHILD[$1]:=}; do walk $i; done
    }

    walk $1
}

nxf_pstat() {
    local data=$(nxf_tree $1)
    local tot=''
    if [[ "$data" ]]; then
      tot=$(awk '{ t3+=($3*10); t4+=($4*10); t5+=$5; t6+=$6; t7+=$7; t8+=$8; t9+=$9; t10+=$10; t11+=$11; t12+=$12; t13+=$13; t14+=$14 } END { printf "%d 0 %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f\n", NR,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 }' <<< "$data")
      printf "$tot\n" || true
    fi
}

nxf_sleep() {
  if [[ $1 < 0 ]]; then sleep 5;
  elif [[ $1 < 10 ]]; then sleep 0.1 2>/dev/null || sleep 1;
  elif [[ $1 < 130 ]]; then sleep 1;
  else sleep 5; fi
}

nxf_date() {
    local ts=$(date +%s%3N); [[ $ts == *3N ]] && date +%s000 || echo $ts
}

nxf_trace() {
  local pid=$1; local trg=$2;
  local tot;
  local count=0;
  declare -a max=(); for i in {0..13}; do max[i]=0; done
  while [[ true ]]; do
    tot=$(nxf_pstat $pid)
    [[ ! $tot ]] && break
    IFS=' ' read -a val <<< "$tot"; unset IFS
    for i in {0..13}; do
      [ ${val[i]} -gt ${max[i]} ] && max[i]=${val[i]}
    done
    echo "pid state %cpu %mem vmem rss peak_vmem peak_rss rchar wchar syscr syscw read_bytes write_bytes" > $trg
    echo "${max[@]}" >> $trg
    nxf_sleep $count
    count=$((count+1))
  done
}

trap 'exit ${ret:=$?}' EXIT
touch .command.trace
start_millis=$(nxf_date)
(
/bin/bash -euo pipefail /mnt/gluster/ubuntu/work/eb/853c8010b8e173b23d8d15489d1a31/.command.sh
) &
pid=$!
nxf_trace "$pid" .command.trace &
mon=$!
wait $pid || ret=$?
end_millis=$(nxf_date)
nxf_kill $mon || wait $mon
echo $((end_millis-start_millis)) >> .command.trace

No more files there...

pditommaso commented 6 years ago

In the log file I only see these lines

May-04 06:05:04.091 [main] DEBUG nextflow.cli.Launcher - $> ./nextflow kuberun wikiselev/rnaseq --genome GRCh37 --reads /mnt/gluster/data/*{1,2}.fastq.gz -v testpvc:/mnt/gluster/
May-04 06:05:04.306 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> https://api.github.com/repos/wikiselev/rnaseq/contents/nextflow.config
May-04 06:05:05.174 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> https://api.github.com/repos/wikiselev/rnaseq/contents/main.nf
May-04 06:05:05.664 [main] DEBUG nextflow.k8s.K8sDriverLauncher - Kubernetes workDir=/mnt/gluster/ubuntu/work; projectDir=/mnt/gluster/projects; volumeClaims=[testpvc:[mountPath:/mnt/gluster]]

Is it right?

wikiselev commented 6 years ago

Sorry Paolo, I've updated the link above.

pditommaso commented 6 years ago

Something similar to #676. Here the response is

 {
     "kind": "Pod",
     "apiVersion": "v1",
     "metadata": {
         "name": "nf-eb853c8010b8e173b23d8d15489d1a31",
         "namespace": "default",
         "selfLink": "/api/v1/namespaces/default/pods/nf-eb853c8010b8e173b23d8d15489d1a31/status",
         "uid": "5ccdeb23-4f69-11e8-89b1-fa163e31bb09",
         "resourceVersion": "2847182",
         "creationTimestamp": "2018-05-04T07:04:18Z",
         "labels": {
             "app": "nextflow",
             "processName": "markDuplicates",
             "runName": "grave-jones",
             "sessionId": "uuid-f51cd941-c21c-447b-86ca-eaebafa5ad9b",
             "taskName": "markDuplicates_22028_2_118_1AlignedByCoord.out"
         }
     },
     "spec": {
         "volumes": [
             {
                 "name": "vol-11",
                 "persistentVolumeClaim": {
                     "claimName": "testpvc"
                 }
             },
             {
                 "name": "default-token-29mnj",
                 "secret": {
                     "secretName": "default-token-29mnj",
                     "defaultMode": 420
                 }
             }
         ],
         "containers": [
             {
                 "name": "nf-eb853c8010b8e173b23d8d15489d1a31",
                 "image": "nfcore/rnaseq:latest",
                 "command": [
                     "/bin/bash",
                     "-ue",
                     ".command.run"
                 ],
                 "workingDir": "/mnt/gluster/ubuntu/work/eb/853c8010b8e173b23d8d15489d1a31",
                 "resources": {
                     "limits": {
                         "cpu": "2",
                         "memory": "16Gi"
                     },
                     "requests": {
                         "cpu": "2",
                         "memory": "16Gi"
                     }
                 },
                 "volumeMounts": [
                     {
                         "name": "vol-11",
                         "mountPath": "/mnt/gluster"
                     },
                     {
                         "name": "default-token-29mnj",
                         "readOnly": true,
                         "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount"
                     }
                 ],
                 "terminationMessagePath": "/dev/termination-log",
                 "terminationMessagePolicy": "File",
                 "imagePullPolicy": "Always"
             }
         ],
         "restartPolicy": "Never",
         "terminationGracePeriodSeconds": 30,
         "dnsPolicy": "ClusterFirst",
         "serviceAccountName": "default",
         "serviceAccount": "default",
         "nodeName": "vlad-k8s-test-k8s-node-nf-3",
         "securityContext": {

         },
         "schedulerName": "default-scheduler"
     },
     "status": {
         "phase": "Pending",
         "conditions": [
             {
                 "type": "PodScheduled",
                 "status": "True",
                 "lastProbeTime": null,
                 "lastTransitionTime": "2018-05-04T07:04:18Z"
             }
         ],
         "qosClass": "Guaranteed"
     }
 }

You need to investigate why this pod is not running. What's the output of the following command?

kubectl describe pod nf-eb853c8010b8e173b23d8d15489d1a31
pditommaso commented 6 years ago

Any feedback on this ?

wikiselev commented 6 years ago

Unfortunately, the pod got deleted by NF, so need to do more investigation. Is there any way to prevent NF deleting the slave pods?

pditommaso commented 6 years ago

Yes, put in the config file:

k8s {
  cleanup = false
}

But you will need to use an snapshot I've just updated prefix the NF command line with the NXF_VER=0.29.1-SNAPSHOT variable as shown below:

NXF_VER=0.29.1-SNAPSHOT nextflow kuberun .. etc
pditommaso commented 6 years ago

Any feedback regarding the lasted snapshot ?

wikiselev commented 6 years ago

Sorry, long weekend, working on it now..

wikiselev commented 6 years ago

Hi Paolo, I've added to config:

k8s {
  cleanup = false
}

But when I run it with the specific version everything fails just in the very beginning with status 127:

ubuntu@k8s-master:~/kubespray$ NXF_VER=0.29.1-SNAPSHOT ./nextflow kuberun wikiselev/rnaseq --genome GRCh37 --reads '/mnt/gluster/data/*_{1,2}.fastq.gz' -v testpvc:/mnt/gluster
Pod started: deadly-stone
N E X T F L O W  ~  version 0.29.1-SNAPSHOT
Launching `wikiselev/rnaseq` [deadly-stone] - revision: b76ef8d373 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
Run Name       : deadly-stone
Reads          : /mnt/gluster/data/*_{1,2}.fastq.gz
Data Type      : Paired-End
Genome         : GRCh37
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/
GTF Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
BED Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 60 GB
Max CPUs       : 7
Max Time       : 10d
Output dir     : ./results
Working dir    : /mnt/gluster/ubuntu/work
Container      : nfcore/rnaseq:latest
Pipeline Release: master
Current home   : /root
Current path   : /mnt/gluster/ubuntu
R libraries    : false
Script dir     : /mnt/gluster/projects/wikiselev/rnaseq
Config Profile : standard
=========================================
[warm up] executor > k8s
[warm up] executor > local
[ba/0eaa12] Submitted process > workflow_summary_mqc
[13/3964fb] Submitted process > trim_galore (22028_2#11)
[20/e65b2c] Submitted process > get_software_versions
[43/b99e2a] Submitted process > trim_galore (22028_2#111)
[83/a28882] Submitted process > fastqc (22028_2#11)
[d2/2442b3] Submitted process > trim_galore (22028_2#118)
[20/e65b2c] NOTE: Process `get_software_versions` terminated with an error exit status (127) -- Error is ignored
[53/c7ca4b] Submitted process > fastqc (22028_2#111)
[1d/32a39e] Submitted process > fastqc (22028_2#118)
ERROR ~ Error executing process > 'trim_galore (22028_2#11)'

Caused by:
  Process `trim_galore (22028_2#11)` terminated with an error exit status (127)

Command executed:

  trim_galore --paired --fastqc --gzip     22028_2#11_1.fastq.gz 22028_2#11_2.fastq.gz

Command exit status:
  127

Command output:
  (empty)

Work dir:
  /mnt/gluster/ubuntu/work/13/3964fb4406d8a260537506329b7b28

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details
[83/a28882] NOTE: Process `fastqc (22028_2#11)` terminated with an error exit status (127) -- Error is ignored
[53/c7ca4b] NOTE: Process `fastqc (22028_2#111)` terminated with an error exit status (127) -- Error is ignored
[1d/32a39e] NOTE: Process `fastqc (22028_2#118)` terminated with an error exit status (127) -- Error is ignored
[nfcore/rnaseq] Pipeline Complete
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.

If I run it without specifying the version of NF it works ok and doing stuff:

ubuntu@k8s-master:~/kubespray$ ./nextflow kuberun wikiselev/rnaseq --genome GRCh37 --reads '/mnt/gluster/data/*_{1,2}.fastq.gz' -v testpvc:/mnt/gluster
Pod started: romantic-stone
N E X T F L O W  ~  version 0.28.0
Launching `wikiselev/rnaseq` [romantic-stone] - revision: b76ef8d373 [master]
===================================
 nfcore/rnaseq  ~  version 1.5dev
===================================
Run Name       : romantic-stone
Reads          : /mnt/gluster/data/*_{1,2}.fastq.gz
Data Type      : Paired-End
Genome         : GRCh37
Strandedness   : None
Trim R1        : 0
Trim R2        : 0
Trim 3' R1     : 0
Trim 3' R2     : 0
Aligner        : STAR
STAR Index     : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/
GTF Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
BED Annotation : s3://ngi-igenomes/igenomes//Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
Save Reference : No
Save Trimmed   : No
Save Intermeds : No
Max Memory     : 60 GB
Max CPUs       : 7
Max Time       : 10d
Output dir     : ./results
Working dir    : /mnt/gluster/ubuntu/work
Container      : nfcore/rnaseq:latest
Pipeline Release: master
Current home   : /root
Current path   : /mnt/gluster/ubuntu
R libraries    : false
Script dir     : /mnt/gluster/projects/wikiselev/rnaseq
Config Profile : standard
=========================================
[warm up] executor > k8s
[warm up] executor > local
[04/1888b3] Submitted process > fastqc (22028_2#118)
[c2/98b3dd] Submitted process > workflow_summary_mqc
[cc/268c9b] Submitted process > fastqc (22028_2#111)
[27/e91060] Submitted process > fastqc (22028_2#11)
[83/400e59] Submitted process > trim_galore (22028_2#118)
[32/44cc09] Submitted process > trim_galore (22028_2#11)
[4a/37c4c2] Submitted process > trim_galore (22028_2#111)
[7d/0b971f] Submitted process > get_software_versions
[7d/0b971f] NOTE: Process `get_software_versions` terminated with an error exit status (127) -- Error is ignored
pditommaso commented 6 years ago

I think was caused by a broken build, therefore I'm closing this issue.

pditommaso commented 6 years ago

I'm re-opening this because it was actually reporting a strange error that seems it was caused by when the container is a Pending state:

     "status": {
         "phase": "Pending",
         "conditions": [
             {
                 "type": "PodScheduled",
                 "status": "True",
                 "lastProbeTime": null,
                 "lastTransitionTime": "2018-05-04T07:04:18Z"
             }
         ],
         "qosClass": "Guaranteed"
     }

Is this still happening with version 0.30.0-SNAPSHOT ?

wikiselev commented 6 years ago

Rerunning now with 0.30.0-SNAPSHOT...

wikiselev commented 6 years ago

Hi Paolo, just rerun it, it failed, but there were no pod issues anymore, only the pipeline specific issues. Actually, I just realised that I get all sort of strange errors, because I was trying to align mouse data to the human genome ;-) I downloaded these data from Human Cell Atlas and didn't check that it was mouse! So, I think you can close this one for now. We gonna run human on human genome using K8s secrets (#651) in half an hour, I will update you on the progress.

pditommaso commented 6 years ago

👍