Closed hukai916 closed 1 year ago
Is it possible that your job was preempted by a higher priority user?
Thanks for your reply!
How could it happen as I am the only user running the pipeline in my own allocated space? What I don't understand is how a "PENDING" job be considered as "COMPLETED" by NF engine as shown in the log file.
Ah, well if you're the only user then it's definitely not preemption.
I should have taken a closer look at your log -- I see that the task logs generated by Nextflow are not being found. If you're using a shared filesystem then I suspect the issue is related to timestamps getting out of sync between the filesystem, the Nextflow job, and the task jobs. I've encountered similar errors while building code with make
on a shared filesystem.
However I'll have to defer to others because I don't remember if there is any troubleshooting advice for Nextflow in this situation.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@hukai916 Hi, I have run into the exact same problem today. Have you found a solution or identified the cause?
Sadly it is hard to replicate and I don't have a solution yet. However, I configure the pipeline to use "short" queues to minimize the pending time (our cluster prioritizes "short" queues), which seem to reduce the incidence.
Hello,
I am also having this problem. The bot marked this issue as 'stale' - but as we are still having this problem, then the issue is therefore not 'stale' nor 'completed'
Similar issue over here.
I'm also having a similar issue, also on LSF. Is it possible to reopen this issue, or find some way to debug the issue? I also am running on LSF, and I also see that the command finishes successfully in .command.out.
Here is my log output:
Jan-20 07:33:45.479 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)'
Caused by:
Process `NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN (H021-PUC89K_metastasis_2)` terminated for an unknown reason -- Likely it has been terminated by the external system
Command executed:
STAR \
--genomeDir star \
--readFilesIn input1/H021-PUC89K_metastasis_2_1_val_1.fq.gz input2/H021-PUC89K_metastasis_2_2_val_2.fq.gz \
--runThreadN 12 \
--outFileNamePrefix H021-PUC89K_metastasis_2. \
\
--sjdbGTFfile gencode.v43.annotation.gtf \
--outSAMattrRGline 'ID:H021-PUC89K_metastasis_2' 'SM:H021-PUC89K_metastasis_2' \
--quantMode TranscriptomeSAM --twopassMode Basic --outSAMtype BAM Unsorted --readFilesCommand gunzip -c --runRNGseed 0 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan Singleend
if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate1 ]; then
mv H021-PUC89K_metastasis_2.Unmapped.out.mate1 H021-PUC89K_metastasis_2.unmapped_1.fastq
gzip H021-PUC89K_metastasis_2.unmapped_1.fastq
fi
if [ -f H021-PUC89K_metastasis_2.Unmapped.out.mate2 ]; then
mv H021-PUC89K_metastasis_2.Unmapped.out.mate2 H021-PUC89K_metastasis_2.unmapped_2.fastq
gzip H021-PUC89K_metastasis_2.unmapped_2.fastq
fi
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASPLICE:RNASPLICE:ALIGN_STAR:STAR_ALIGN":
star: $(STAR --version | sed -e "s/STAR_//g")
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
END_VERSIONS
Command exit status:
-
I also notice that the job sometimes stays in the queue with STAT UNKWN.
Lack of exit status suggests that the job was terminated abruptly by your scheduler. Usually they at least give you an exit code if you exceeded your resource limits. Maybe the node was shut down? You'll have to inspect the job logs and status from the scheduler for any underlying cause. If you can find a root cause then we can try to handle it better in Nextflow, otherwise there isn't much that we can do
I was having a nearly identical problem using Nextflow with Slurm. Most jobs were not starting and no error was being reported except for a small number of instances where there was a 255 error code. Of the jobs that did start, they were often killed with no error reported.
We found that the Slurm configuration file was not synced across nodes. Updating the permissions and re-synchronizing the configuration file resolved the issue.
same issue here with LSF. our system admin says that LSF tries and kill job that exceed resources using different strategies, and if the OS is not responsive enough, it eventually ends up using an aggressive kill code -the exit code that is reported from the .command.log
file is 9
, suggesting the use of kill -9
by LSF - which won't leave enough time to Nextflow to capture the exit code (no .exitcode
file is produced).
Sadly this throws our error-handling strategy (see here), which relies on the interpretation of the error codes to retry with escalated resources where relevant - so that we end up ignoring the failed jobs, even though they could succeed if provided with more resources.
our environment is this: OS: Ubuntu 22.04 / Linux 5.15.0-94-generic #104-Ubuntu SMP x86_64 GNU/Linux scheduler: IBM Spectrum LSF Version 10.1 Fix Pack 14 Nextflow: 23.10.1-5891 Singularity: 3.11.4
just expanding on my case, so that there is material for thinking a solution:
I run a nextflow pipeline that has reported something like this sometimes when the resource usage of the task process is exceeding the requested amount:
[71/c2120a] NOTE: Process `ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored
On a test run, it seems like a little over half of the processes failed for excess memory use report this error and will be ignored for further processing i.e. no retrying.
The rest would report the usual exit status 140 (SIGUSR2) leading to retrying the execution under our error-handling strategy.
looking for the execution report of the failed process example above in the pipeline trace, I can see this:
$ head -1 results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt && grep '71/c2120a' results/pipeline_info/execution_trace_2024-05-20_12-36-38.txt
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
652 71/c2120a 100777 ABUNDANCE_ESTIMATION:BOWTIE2SAMTOOLS (48888_2#90) FAILED - 2024-05-20 12:41:33.456 12m 22s 11m 35s - - - - -
the exit code is reported as -
, which I would suspect corresponds to a null
value of the $exitStatus
Groovy variable as seen by the master Nextflow process.
but if you go to the task work folder, you see this:
$ cd work/71/c2120acc53e89f36034211b95ce7a9/
$ ls -ltrA
total 198036
-rw-r--r-- 1 pathpipe rvidata 369 May 20 12:41 .command.sh
-rw-r--r-- 1 pathpipe rvidata 12058 May 20 12:41 .command.run
-rw-r--r-- 1 pathpipe rvidata 0 May 20 12:42 .command.trace
-rw-r--r-- 1 pathpipe rvidata 0 May 20 12:42 .command.out
-rw-r--r-- 1 pathpipe rvidata 0 May 20 12:42 .command.err
-rw-r--r-- 1 pathpipe rvidata 0 May 20 12:42 .command.begin
lrwxrwxrwx 1 pathpipe rvidata 175 May 20 12:42 48888_2#90_1_kneaddata_paired_2.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_2.fastq
lrwxrwxrwx 1 pathpipe rvidata 175 May 20 12:42 48888_2#90_1_kneaddata_paired_1.fastq -> /lustre/scratch127/pam/projects/rvidata/bait_capture/rvi_prod_bait_dilution_exp-2024_05_16-48888_2/work/37/35a37bae527ea07e72417a491798c9/48888_2#90_1_kneaddata_paired_1.fastq
-rw-r--r-- 1 pathpipe rvidata 202753086 May 20 12:48 48888_2#90.sorted.bam.tmp.0000.bam
-rw-r--r-- 1 pathpipe rvidata 3825 May 20 12:49 .command.log
$ less .command.log
[... skipping to the bottom ...]
------------------------------------------------------------
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with signal termination: 9.
Resource usage summary:
CPU time : 1528.00 sec.
Max Memory : 4131 MB
Average Memory : 2577.74 MB
Total Requested Memory : 4096.00 MB
Delta Memory : -35.00 MB
Max Swap : -
Max Processes : 24
Max Threads : 49
Run time : 446 sec.
Turnaround time : 450 sec.
The output (if any) is above this job summary.
the actual exit status of the job was 9
, but that has not been written into an .exitcode
file, which is missing.
Now the question is: what is the behaviour of the Nexflow master process from this. if $exitStatus
Groovy variable is indeed taking a value of null
, this is something we could handle.
However, as you can see here we actually already handle a null
value for this variable, and that is not solving the issue.
Instead, from the message in the log terminated for an unknown reason -- Likely it has been terminated by the external system -- Error is ignored
, it sounds like the nextflow master process is deciding on its own to ignore the error, bypassing the retry strategy directives from our config.
@bentsherman is there a way to make nextflow use the actual exit code? or at least not force ignore the error?
any suggestion would be most helpful.
many thanks,
Florent
update:
I devised that when the terminated for an unknown reason
error occurs i.e. when no .exitcode
file is produced, the value of the $task.exitStatus
variable is not set as null
but as 2147483647
(largest possible value of signed 32-bit integer).
I'll see if I can rely on this to salvage our retry strategy (and how robust this hack can be)
it's not solving the issue - which I think now is really something to do with LSF / our HPC, but at least I may have got a workaround to keep our Nextflow pipelines afloat
Seqera support confirms that:
In instances where the exit code file isn't generated, we utilize "Integer.MAX_VALUE" as an indicator that no exit code was returned.
if Nextflow devs ever want to build a handling mechanism for this, I personally think the solution would be that when the .exitcode
file is not found, and that the executor is lsf
, Nextflow could try to look at the the bottom of the .command.log
file for the job summary footer that is appended by LSF on closing the job, as this would feature the actual exit code, which could be parsed and then fed to the $task.exitStatus
, or maybe more robustly to used to create an .exitcode
file.
Bug report
Expected behavior and actual behavior
I expect the NF engine to correctly capture the job status code submitted to LSF scheduler, but it fails to do so occasionally.
Steps to reproduce the problem
I didn't have this problem using the same workflow over the past a few months, but starts to encounter it recently. This problem is hard to reproduce because it happens sporadically. I was able to capture one of such cases and traced the .nextflow.log file and found the following facts:
May-03 09:10:36.762
, it was still PENDING 10 minutes later atMay-03 09:20:40.795
, but was considered as COMPLETED only a few milliseconds later atMay-03 09:20:40.803
, and NF engine couldn't locate it output files because they were not being generated at that time. Therefore, NF generated an error and terminated all other processes and exited.May-03 09:20:40.795 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: RUNNING; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270000 Current queue status:
May-03 09:20:40.803 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5535024; id: 25; name: SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11); status: COMPLETED; exit: -; error: -; workDir: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642 started: 1651583530790; exited: -; ] May-03 09:20:40.820 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.out May-03 09:20:40.822 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.err May-03 09:20:40.823 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /project/umw_cole_haynes/Kai/scATACpipe/work/62/d7fada000d75bd3246b78bb41c0642/.command.log May-03 09:20:40.825 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'SCATACPIPE:PREPROCESS_DEFAULT:ADD_BARCODE_TO_READS (11)'