Process terminated for an unknown reason -- Likely it has been terminated by the external system

hukai916 commented 2 years ago

Bug report

Expected behavior and actual behavior

I expect the NF engine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally.

Steps to reproduce the problem

I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently. This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts:

The job (jobId: 5535024) was actually completed without problem according to its .command.log file:

Started at Tue May  3 09:20:54 2022
Terminated at Tue May  3 18:24:37 2022
Results reported at Tue May  3 18:24:37 2022

The job was submitted to LSF at May-03 09:10:36.762, it was still PENDING 10 minutes later at May-03 09:20:40.795, but was considered as COMPLETED only a few milliseconds later at May-03 09:20:40.803, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited.

Relevant log is below:


May-03 09:10:36.762 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [LSF] submitted process SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11) > jobId: 5535024; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642

May-03 09:20:40.795 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: RUNNING; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270000 Current queue status:

job: 5535024: PENDING

May-03 09:20:40.803 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: COMPLETED; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] May-03 09:20:40.820 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.out May-03 09:20:40.822 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.err May-03 09:20:40.823 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.log May-03 09:20:40.825 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11)'



The question is boiled down to "Why a job actually finished at `May  3 18:24:37 2022` was considered as COMPLETED at `May-03 09:20:40.803` ?

I noticed similar issues here (https://github.com/nextflow-io/nextflow/issues/2540), here (https://github.com/nextflow-io/nextflow/issues/1045), and here (https://github.com/nextflow-io/nextflow/issues/1644).

Any input to further debug will be highly appreciated.

### Environment 
* Nextflow version: [nextflow version 21.10.6.5660] 
* Java version: [openjdk version "1.8.0_92"]
* Bash version: (use the command `$SHELL --version`) [GNU bash, version 4.4.20(1)-release (x86_64-redhat-linux-gnu)]

bentsherman commented 2 years ago

Is it possible that your job was preempted by a higher priority user?

hukai916 commented 2 years ago

Thanks for your reply!

How could it happen as I am the only user running the pipeline in my own allocated space? What I don't understand is how a "PENDING" job be considered as "COMPLETED" by NF engine as shown in the log file.

bentsherman commented 2 years ago

Ah, well if you're the only user then it's definitely not preemption.

I should have taken a closer look at your log -- I see that the task logs generated by Nextflow are not being found. If you're using a shared filesystem then I suspect the issue is related to timestamps getting out of sync between the filesystem, the Nextflow job, and the task jobs. I've encountered similar errors while building code with make on a shared filesystem.

However I'll have to defer to others because I don't remember if there is any troubleshooting advice for Nextflow in this situation.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

xz-stjude commented 1 year ago

@hukai916 Hi, I have run into the exact same problem today. Have you found a solution or identified the cause?

hukai916 commented 1 year ago

Sadly it is hard to replicate and I don't have a solution yet. However, I configure the pipeline to use "short" queues to minimize the pending time (our cluster prioritizes "short" queues), which seem to reduce the incidence.

TBradley27 commented 11 months ago

Hello,

I am also having this problem. The bot marked this issue as 'stale' - but as we are still having this problem, then the issue is therefore not 'stale' nor 'completed'

timothy-barry commented 9 months ago

Similar issue over here.

dkoppstein commented 9 months ago

I'm also having a similar issue, also on LSF. Is it possible to reopen this issue, or find some way to debug the issue? I also am running on LSF, and I also see that the command finishes successfully in .command.out.

Here is my log output:

Jan-20 07:33:45.479 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)'

Caused by:
  Process `NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  STAR \
      --genomeDir star \
      --readFilesIn input1/H021-PUC89K_metastasis_2_1_val_1.fq.gz input2/H021-PUC89K_metastasis_2_2_val_2.fq.gz \
      --runThreadN 12 \
      --outFileNamePrefix H021-PUC89K_metastasis_2. \
       \
      --sjdbGTFfile gencode.v43.annotation.gtf \
      --outSAMattrRGline 'ID:H021-PUC89K_metastasis_2'  'SM:H021-PUC89K_metastasis_2'  \
      --quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend

  if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate1 ]; then
      mv H021-PUC89K_metastasis_2.Unmapped.out.mate1 H021-PUC89K_metastasis_2.unmapped_1.fastq
      gzip H021-PUC89K_metastasis_2.unmapped_1.fastq
  fi
  if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate2 ]; then
      mv H021-PUC89K_metastasis_2.Unmapped.out.mate2 H021-PUC89K_metastasis_2.unmapped_2.fastq
      gzip H021-PUC89K_metastasis_2.unmapped_2.fastq
  fi

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN":
      star: $(STAR --version | sed -e "s/STAR_//g")
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
      gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
  END_VERSIONS

Command exit status:
  -

I also notice that the job sometimes stays in the queue with STAT UNKWN.

bentsherman commented 9 months ago

Lack of exit status suggests that the job was terminated abruptly by your scheduler. Usually they at least give you an exit code if you exceeded your resource limits. Maybe the node was shut down? You'll have to inspect the job logs and status from the scheduler for any underlying cause. If you can find a root cause then we can try to handle it better in Nextflow, otherwise there isn't much that we can do

kerrycobb commented 6 months ago

I was having a nearly identical problem using Nextflow with Slurm. Most jobs were not starting and no error was being reported except for a small number of instances where there was a 255 error code. Of the jobs that did start, they were often killed with no error reported.

We found that the Slurm configuration file was not synced across nodes. Updating the permissions and re-synchronizing the configuration file resolved the issue.

flass commented 5 months ago

same issue here with LSF. our system admin says that LSF tries and kill job that exceed resources using different strategies, and if the OS is not responsive enough, it eventually ends up using an aggressive kill code -the exit code that is reported from the .command.log file is 9, suggesting the use of kill -9 by LSF - which won't leave enough time to Nextflow to capture the exit code (no .exitcode file is produced). Sadly this throws our error-handling strategy (see here), which relies on the interpretation of the error codes to retry with escalated resources where relevant - so that we end up ignoring the failed jobs, even though they could succeed if provided with more resources.

our environment is this: OS: Ubuntu 22.04 / Linux 5.15.0-94-generic #104-Ubuntu SMP x86_64 GNU/Linux scheduler: IBM Spectrum LSF Version 10.1 Fix Pack 14 Nextflow: 23.10.1-5891 Singularity: 3.11.4

flass commented 5 months ago

just expanding on my case, so that there is material for thinking a solution:

I run a nextflow pipeline that has reported something like this sometimes when the resource usage of the task process is exceeding the requested amount:

[71/c2120a] NOTE: Process `ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored

On a test run, it seems like a little over half of the processes failed for excess memory use report this error and will be ignored for further processing i.e. no retrying.
The rest would report the usual exit status 140 (SIGUSR2) leading to retrying the execution under our error-handling strategy.

looking for the execution report of the failed process example above in the pipeline trace, I can see this:

$ head -1 results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt && grep '71/c2120a' results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt
task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    peak_rss    peak_vmem   rchar   wchar
652 71/c2120a   100777  ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90)   FAILED  -   2024-05-20 12:41:33.456 12m 22s 11m 35s -   -   -   -   -

the exit code is reported as -, which I would suspect corresponds to a null value of the $exitStatus Groovy variable as seen by the master Nextflow process. but if you go to the task work folder, you see this:

$ cd work/71/c2120acc53e89f36034211b95ce7a9/
$ ls -ltrA
total 198036
-rw-r--r-- 1 pathpipe rvidata       369 May 20 12:41 .command.sh
-rw-r--r-- 1 pathpipe rvidata     12058 May 20 12:41 .command.run
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.trace
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.out
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.err
-rw-r--r-- 1 pathpipe rvidata         0 May 20 12:42 .command.begin
lrwxrwxrwx 1 pathpipe rvidata       175 May 20 12:42 48888_2#90_1_kneaddata_paired_2.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_2.fastq
lrwxrwxrwx 1 pathpipe rvidata       175 May 20 12:42 48888_2#90_1_kneaddata_paired_1.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_1.fastq
-rw-r--r-- 1 pathpipe rvidata 202753086 May 20 12:48 48888_2#90.sorted.bam.tmp.0000.bam
-rw-r--r-- 1 pathpipe rvidata      3825 May 20 12:49 .command.log
$ less .command.log

[... skipping to the bottom ...]
------------------------------------------------------------

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with signal termination: 9.

Resource usage summary:

    CPU time :                                   1528.00 sec.
    Max Memory :                                 4131 MB
    Average Memory :                             2577.74 MB
    Total Requested Memory :                     4096.00 MB
    Delta Memory :                               -35.00 MB
    Max Swap :                                   -
    Max Processes :                              24
    Max Threads :                                49
    Run time :                                   446 sec.
    Turnaround time :                            450 sec.

The output (if any) is above this job summary.

the actual exit status of the job was 9, but that has not been written into an .exitcode file, which is missing.

Now the question is: what is the behaviour of the Nexflow master process from this. if $exitStatus Groovy variable is indeed taking a value of null, this is something we could handle.
However, as you can see here we actually already handle a null value for this variable, and that is not solving the issue.
Instead, from the message in the log terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored, it sounds like the nextflow master process is deciding on its own to ignore the error, bypassing the retry strategy directives from our config.
@bentsherman is there a way to make nextflow use the actual exit code? or at least not force ignore the error?

any suggestion would be most helpful.

many thanks,

Florent

flass commented 5 months ago

update:
I devised that when the terminated for an unknown reason error occurs i.e. when no .exitcode file is produced, the value of the $task.exitStatus variable is not set as null but as 2147483647 (largest possible value of signed 32-bit integer).
I'll see if I can rely on this to salvage our retry strategy (and how robust this hack can be) it's not solving the issue - which I think now is really something to do with LSF / our HPC, but at least I may have got a workaround to keep our Nextflow pipelines afloat

flass commented 4 months ago

Seqera support confirms that:

In instances where the exit code file isn't generated, we utilize "Integer.MAX_VALUE" as an indicator that no exit code was returned.

if Nextflow devs ever want to build a handling mechanism for this, I personally think the solution would be that when the .exitcode file is not found, and that the executor is lsf, Nextflow could try to look at the the bottom of the .command.log file for the job summary footer that is appended by LSF on closing the job, as this would feature the actual exit code, which could be parsed and then fed to the $task.exitStatus, or maybe more robustly to used to create an .exitcode file.

nextflow-io / nextflow