Closed tomas-pluskal closed 6 years ago
Interesting, please include the related .nextflow.log
file.
Sure, here it is
Unfortunately not enough. Could you please run it again using the snapshot I've just uploaded using the command below:
NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc
Then upload again the .nextflow.log
file. Thanks!
It shows this:
[73/fc15ed] Submitted process > trinityInchwormChrysalis
WARN: [LSF] queue status cannot be fetched
- cmd executed: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader
- exit status : 255
- output :
bjobs: illegal option -- o
Usage: bjobs [-A] [-a] [-aps| -l | -w ] [-d] [-p] [-s] [-r] [-W] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-N host_name | -N host_model | -N cpu_factor]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-a] [-aps] [-d] [-p] [-s] [-r] [-WL] [-WF] [-WP] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-a] [-l | -w] [-d] [-p] [-s] [-ss] [-r] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-h] [-V]
[-WL | -WP | -WF] output key:
X: The real run time has exceeded the configured run time
L: A run limit exists but job does not have a configured run time
E: A configured run time exists and has not been exceeded
I guess the -o option is not supported in our LSF version:
$ bjobs -V
Platform LSF 8.0.1.174931, Jun 13 2011
Copyright 1992-2011 Platform Computing Corporation
binary type: linux2.6-glibc2.3-x86_64
Yeah, exactly. Worth nothing that in the latest NF version the command didn't changed. It's just showing the failing command that before was silently ignored.
Could you try to investigate with your sysadmins, if there's a legacy syntax as replacement for the -o
option?
What exactly do you need the -o option for?
There's 'bjobs -w' which outputs more details and also 'bjobs -l':
-w
Wide format. Displays job information without
truncating fields.
-l
Long format. Displays detailed information for
each job in a multiline format.
The -l option displays the following additional
information: project name, job command, current
working directory on the submission host, initial
checkpoint period, checkpoint directory, migration
threshold, pending and suspending reasons, job
status, resource usage, resource usage limits
information, runtime resource usage information on
the execution hosts, and job description
If the job was submitted with bsub -K, the -l
option displays Synchronous Execution.
Use bjobs -A -l to display detailed information
for job arrays including job array job limit (%
job_limit) if set.
Use bjobs -ss -l to display detailed information
for session scheduler jobs.
If JOB_IDLE is configured in the queue, use bjobs
-l to display job idle exception information.
If you submitted your job with the -U option to
use advance reservations created with the brsvadd
command, bjobs -l shows the reservation ID used by
the job.
To format the output only with the JOBID
, STAT
, and SUBMIT_TIME
columns. For example:
6795348,RUN,Feb 17 13:26
6795349,RUN,Feb 17 13:26
6795351,PEND,Feb 17 13:26
6795353,PSUSP,Feb 17 13:26
6795354,EXIT,Feb 17 13:26
Then you should be able to use just plain 'bjobs' and parse the relevant columns... just a bit of extra work :)
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5062199 pluskal RUN normal it-c01b14 it-c05b02 *Chrysalis Oct 28 13:57
it-c05b02
it-c05b02
it-c05b02
it-c05b02
it-c05b02
it-c05b02
it-c05b02
it-c05b02
it-c05b02
5062198 pluskal RUN normal tak it-c01b14 *sminer.nf Oct 28 13:57
The problem is not to break existing deployment. Not sure the above format is consistent across different LSF versions.
Here's more info about bjobs output: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/bjobs.zz4category.description.1.html
I think it's fair to assume that the name and order of columns might change between LSF versions. But if you parse the first line and find the position of the JOBID, STAT, and SUBMIT_TIME columns, that should be fairly robust.
@tdelhomme @mfoll @sb10 @micans Can I ask your help to solve this issue with LSF. Could you please submit a couple of dummy jobs and copy here the exact output of the command bjobs -u $USER
followed by bjobs -V
? thanks a lot.
With the snapshot NXF_VER=18.11.0-SNAPSHOT
?
Any, or even w/o nextflow, just submit a couple of just and then check the jobs status with the above command.
two jobs ..
bc-3,cellgen/cellgeni, bjobs -u $USER
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
9025068 svd UNKWN long farm3-head5 bc-31-1-12 bash Oct 10 10:54
9958620 svd RUN long farm3-head5 bc-30-2-03 bash Oct 29 10:10
bc-3,cellgen/cellgeni, bjobs -V
IBM Platform LSF 9.1.3.0 build 238688, Jul 11 2014
Copyright IBM Corp. 1992, 2014. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
binary type: linux2.6-glibc2.3-x86_64
[delhommet@hn ~]$ bjobs -u $USER
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
960417 delhomm RUN normal hn cn01 *-125_154) Oct 29 13:37
960419 delhomm RUN normal hn cn01 *-125_154) Oct 29 13:37
[delhommet@hn ~]$ bjobs -V
IBM Platform LSF 9.1.1.0 build 213132, Feb 27 2013
Copyright International Business Machines Corp, 1992-2013.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
binary type: linux2.6-glibc2.3-x86_64
[delhommet@hn ~]$ nextflow -version
N E X T F L O W
version 18.10.1 build 5003
last modified 24-10-2018 14:03 UTC (16:03 CEST)
cite doi:10.1038/nbt.3820
http://nextflow.io
@tomas-pluskal Could you please post the output of the command bjobs -w
showing a few jobs ?
Sure:
$ bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5085109 pluskal RUN normal it-c01b13 it-c01b11 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_4 Oct 31 13:00
5085585 pluskal RUN normal tak it-c05b12 phenix.autobuild autobuild_config.eff Oct 31 14:05
5085604 pluskal RUN normal it-c05b12 it-c05b12 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_1 Oct 31 14:44
5085611 pluskal RUN normal it-c05b12 it-c05b12 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_8 Oct 31 14:44
5085107 pluskal RUN normal it-c01b13 it-c01b07 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_2 Oct 31 13:00
5085607 pluskal RUN normal it-c05b12 it-c01b07 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_4 Oct 31 14:44
5085608 pluskal RUN normal it-c05b12 it-c01b05 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_5 Oct 31 14:44
5085609 pluskal RUN normal it-c05b12 it-c01b14 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_6 Oct 31 14:44
5085702 pluskal RUN normal tak it-c01b14 ./assemble.nf Oct 31 15:08
5085610 pluskal RUN normal it-c05b12 it-c05b07 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_7 Oct 31 14:44
5085108 pluskal RUN normal it-c01b13 it-c01b12 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_3 Oct 31 13:00
5085606 pluskal RUN normal it-c05b12 it-c01b12 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_3 Oct 31 14:44
5085034 pluskal RUN normal tak it-c01b13 phenix.autobuild autobuild_config.eff Oct 31 12:56
5085106 pluskal RUN normal it-c01b13 it-c01b13 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_1 Oct 31 13:00
5085605 pluskal RUN normal it-c05b12 it-c01b13 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_2 Oct 31 14:44
5085612 pluskal RUN normal it-c05b12 it-c01b13 /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_9 Oct 31 14:44
5085703 pluskal RUN normal it-c01b14 it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31 nf-trinityInchwormChrysalis Oct 31 15:08
Thanks. I'm wondering if the JOBID
can be truncated without the -w
option?
No, I think the JOBID is always printed in full:
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5085109 pluskal RUN normal it-c01b13 it-c01b11 *UN_FILE_4 Oct 31 13:00
5085585 pluskal RUN normal tak it-c05b12 *onfig.eff Oct 31 14:05
5085107 pluskal RUN normal it-c01b13 it-c01b07 *UN_FILE_2 Oct 31 13:00
5085702 pluskal RUN normal tak it-c01b14 *semble.nf Oct 31 15:08
5085610 pluskal RUN normal it-c05b12 it-c05b07 *UN_FILE_7 Oct 31 14:44
5085108 pluskal RUN normal it-c01b13 it-c01b12 *UN_FILE_3 Oct 31 13:00
5085034 pluskal RUN normal tak it-c01b13 *onfig.eff Oct 31 12:56
5085106 pluskal RUN normal it-c01b13 it-c01b13 *UN_FILE_1 Oct 31 13:00
5085703 pluskal RUN normal it-c01b14 it-r16u31 *Chrysalis Oct 31 15:08
it-r16u31
it-r16u31
it-r16u31
it-r16u31
it-r16u31
it-r16u31
it-r16u31
it-r16u31
it-r16u31
Not so sure, googling a bit it looks the job id can take up to 10 digits.
I won't be surprised that it's truncated when taking more than 7 digits as shown in the above table.
You're right. Apparently, this can be configured: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_config_ref/lsf.conf.lsb_jobid_disp_length.5.html
gotcha!
I've uploaded a snapshot with this path. You can try it running:
NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc
It still displays the same warning for me:
WARN: [LSF] queue status cannot be fetched
- cmd executed: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader
- exit status : 255
- output :
bjobs: illegal option -- o
Usage: bjobs [-A] [-a] [-aps| -l | -w ] [-d] [-p] [-s] [-r] [-W] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-N host_name | -N host_model | -N cpu_factor]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-a] [-aps] [-d] [-p] [-s] [-r] [-WL] [-WF] [-WP] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-a] [-l | -w] [-d] [-p] [-s] [-ss] [-r] [-X] [-x]
[-app application_profile_name] [-g job_group_name]
[-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
[-m "host_name ..." | -m host_group | -m cluster_name]
[-P project_name] [-q queue_name]
[-u user_name | -u user_group | -u all | -G user_group]
[jobId | "jobId[index_list]" ...]
bjobs [-h] [-V]
[-WL | -WP | -WF] output key:
X: The real run time has exceeded the configured run time
L: A run limit exists but job does not have a configured run time
E: A configured run time exists and has not been exceeded
Can you double check you have the right version. The following command should print:
NXF_VER=18.11.0-SNAPSHOT nextflow info
Version: 18.11.0-SNAPSHOT build 5014
Modified: 31-10-2018 17:32 UTC (18:32 CEST)
Seems like it doesn't update:
$ NXF_VER=18.11.0-SNAPSHOT nextflow info
Version: 18.11.0-SNAPSHOT build 5011
Modified: 28-10-2018 17:45 UTC (13:45 EDT)
Why could that be?
If so run:
NXF_VER=18.11.0-SNAPSHOT CAPSULE_RESET=1 nextflow info
then:
NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc
Yes, that helped. It does not show the warning anymore.
Thanks for the troubleshooting!
Hi, I noticed that my nextflow jobs now fail with errors like this:
[warm up] executor > lsf
[af/9267bf] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'
Caused by:
Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system
This is my nextflow version and nextflow config:
$ NXF_VER=18.11.0-SNAPSHOT nextflow info
Version: 18.11.0-SNAPSHOT build 5015
Modified: 12-11-2018 16:50 UTC (11:50 EDT)
System: Linux 3.13.0-151-generic
Runtime: Groovy 2.5.3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12
Encoding: UTF-8 (UTF-8)
$ NXF_VER=18.11.0-SNAPSHOT nextflow config
process {
executor = 'lsf'
}
executor {
queueSize = 1000
}
env {
PATH = '/lab/solexa_weng/testtube/trinityrnaseq-Trinity-v2.8.4:/lab/solexa_weng/testtube/TransDecoder-TransDecoder-v5.0.2:/lab/solexa_weng/testtube/transrate-1.0.3-linux-x86_64:/lab/solexa_weng/testtube/signalp-4.1:/lab/solexa_weng/testtube/tmhmm-2.0c/bin:/lab/solexa_weng/testtube/ncbi-blast-2.7.1+/bin:/lab/solexa_weng/testtube/bowtie2-2.3.4.3-linux-x86_64:/lab/solexa_weng/testtube/nextflow-0.32.0:/usr/local/jdk1.8/bin:/home/pluskal/.gem/ruby/2.3.0/bin:/lab/solexa_weng/testtube/miniconda3/bin:/home/pluskal/bin:/nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/etc:/nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:::/usr/local/genome/bin'
}
report {
enabled = true
}
Strangely enough, the job that is reported as "terminated for an unknown reason" is still running on the LSF cluster:
$ bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5132674 pluskal RUN normal it-c01b07 it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 00:16
I suppose there might still be some issues with handling the LSF queue..?
I have attached the .nextflow.log file
@pditommaso I noticed in the log file that the nextflow process crashed exactly 5 minutes after starting. I believe 5 min is the timer for checking the LSF queue, so it definitely has something to do with that:
Nov-13 00:21:33.104 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor lsf > tasks to be completed: 1 -- pending tasks are shown below
~> TaskHandler[jobId: 5132674; id: 1; name: trinityInchwormChrysalis; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 started: 1542086223123; exited: -; ]
Nov-13 00:23:19.147 [Task monitor] DEBUG nextflow.file.FileHelper - Can't list folder: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 -- Exit status: 143
Nov-13 00:26:18.254 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 5132674; id: 1; name: trinityInchwormChrysalis; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 started: 1542086223123; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 370103
Current queue status:
> (null)
Can you please try using the 18.11.0-edge
release and attaching here the complete .nextflow.log
file?
I tried it: NXF_VER=18.11.0-edge nextflow assemble.nf
It produces the same error:
[warm up] executor > lsf
[21/283e00] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'
Caused by:
Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system
The log file is attached: nextflow_log.txt
But the actual task is taking less than 5 mins or it's still running?
It is still running.
what's the output of bjobs -w
?
$ bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5132676 pluskal RUN normal tak it-c01b04 /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137332 pluskal RUN normal it-c01b11 it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 11:07
5137357 pluskal RUN normal it-c01b11 it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 11:35
Instead if you use bjobs -w -q <the queue name you put in the config file>
?
I did not specify any queue in the config file. All jobs run in the default queue, I guess.
Something strange is happening. Please execute one more time using this command:
NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf
Then upload again the log file when it stops.
Strangely enough, when I started it with this command, it doesn't seem to crash. I have attached the current log file (but nextflow is still running, as it should be).
Ok, it did crash in the end. But it took much longer time.
[warm up] executor > lsf
[ba/cbcdea] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'
Caused by:
Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system
The log is attached below.
From the log it seems that bjobs -w
returned no output, but exited with exit code 0. That is a bit strange, I have never seen that happen when I run bjobs manually. Perhaps nextflow could double check if bjobs properly printed the header line (JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
).
It looks there's a hiccup with your cluster. NF periodically run bjobs to fetch tasks status, eg.
Nov-13 12:41:59.838 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:41:59.861 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5132676 pluskal RUN normal tak it-c01b04 /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137404 pluskal RUN normal it-c03b03 it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 12:18
5137399 pluskal RUN normal tak it-c03b03 NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf Nov 13 12:17
Some minutes later bjobs returns an empty result:
Nov-13 12:42:59.840 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:42:59.861 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0
However the jobs it's still running, indeed later in the log, it appears again:
Nov-13 12:44:59.853 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:44:59.870 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5132676 pluskal RUN normal tak it-c01b04 /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137404 pluskal RUN normal it-c03b03 it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 12:18
5137399 pluskal RUN normal tak it-c03b03 NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf Nov 13 12:17
However the empty result makes NF to mark that jobs as into an inconsistent status. It set a timeout make it to fail fast when as soon the job exit (successfully) later, because the .exitcode
is not created (yet apparently):
Content of workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/ba/cbcdeac0e24debb012453375935c56
> total 204
> drwxrwsr-x 3 pluskal wenglab 306 Nov 13 12:18 .
> drwxrwsr-x 3 pluskal wenglab 48 Nov 13 12:18 ..
> -rw-rw-r-- 1 pluskal wenglab 0 Nov 13 12:18 .command.begin
> -rw-rw-r-- 1 pluskal wenglab 0 Nov 13 12:18 .command.err
> -rw-rw-r-- 1 pluskal wenglab 7626 Nov 13 13:04 .command.log
> -rw-rw-r-- 1 pluskal wenglab 7626 Nov 13 13:04 .command.out
> -rw-rw-r-- 1 pluskal wenglab 2921 Nov 13 12:18 .command.run
> -rw-rw-r-- 1 pluskal wenglab 247 Nov 13 12:18 .command.sh
> -rw-rw-r-- 1 pluskal wenglab 3500 Nov 13 12:18 .command.stub
> -rw-rw-r-- 1 pluskal wenglab 210 Nov 13 13:08 .command.trace
> lrwxrwxrwx 1 pluskal wenglab 53 Nov 13 12:18 samples.txt -> /lab/weng_scratch/Tomas/nextflow_assemble/samples.txt
> drwxrwsr-x 3 pluskal wenglab 705 Nov 13 13:04 trinity_out_dir
I'm closing this issue and open a separate one to track this problem.
Just wanted to mention that the bjobs
command may occasionally print out additional messages like LSF is processing your request. Please wait ...
, as shown below. This is something nextflow should anticipate.
$ bjobs
LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
5157393 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_bd) Nov 14 13:00
5157493 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_cn) Nov 14 13:00
5157552 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_dl) Nov 14 13:00
5157610 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_fe) Nov 14 13:00
5157674 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_gf) Nov 14 13:00
5157710 pluskal RUN normal it-c05b07 it-c05b10 *l_cmd_fv) Nov 14 13:00
This output is already parsed correctly. Thanks for point it out in any case.
PBS Pro gives me the same messages but different exit status.
$ ./main.nf
N E X T F L O W ~ version 18.10.1
Launching './main.nf' [hungry_mclean] - revision: b639ebf99c
[warm up] executor > pbs
[8b/2793ea] Submitted process > P00_ReadDecompression (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2
[1f/7b5751] Submitted process > P01_FullMapping (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
[13/71ad23] Submitted process > P02_FullMappingExtraction (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
WARN: [PBS] queue status cannot be fetched > exit status: 2
[3d/29877d] Submitted process > P03_Bam2Fastq (1)
[cb/6789a0] Submitted process > P03_Bam2Fastq (2)
@pachiras Please report this in a different issue including the .nextflow.log
file.
Bug report
Hi! After upgrading to nextflow 18.10.1 from 0.32.0, I started seeing this message repeatedly in nextflow output:
All cluster jobs, however, seem to be working fine, and the nextflow pipeline is producing all files normally.
This is my nextflow.config file: