Queue status command fail on LSF version 8

tomas-pluskal commented 5 years ago

Bug report

Hi! After upgrading to nextflow 18.10.1 from 0.32.0, I started seeing this message repeatedly in nextflow output:

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

WARN: [LSF] queue status cannot be fetched > exit status: 255

All cluster jobs, however, seem to be working fine, and the nextflow pipeline is producing all files normally.

This is my nextflow.config file:

process.executor = "lsf"
executor.queueSize = 1000

env.PATH = "/lab/solexa_weng/testtube/trinityrnaseq-Trinity-v2.8.4:/lab/solexa_weng/testtube/TransDecoder-TransDecoder-v5.0.2:/lab/solexa_weng/testtube/transrate-1.0.3-linux-x86_64:/lab/solexa_weng/testtube/signalp-4.1:/lab/solexa_weng/testtube/tmhmm-2.0c/bin:/lab/solexa_weng/testtube/ncbi-blast-2.7.1+/bin:/lab/solexa_weng/testtube/bowtie2-2.3.4.3-linux-x86_64:$PATH"

report.enabled = true

Nextflow version: 18.10.1 build 5003
Java version: 1.8.0_161
Operating system: Linux

pditommaso commented 5 years ago

Interesting, please include the related .nextflow.log file.

tomas-pluskal commented 5 years ago

Sure, here it is

nextflowlog.txt

pditommaso commented 5 years ago

Unfortunately not enough. Could you please run it again using the snapshot I've just uploaded using the command below:

NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc

Then upload again the .nextflow.log file. Thanks!

tomas-pluskal commented 5 years ago

It shows this:

[73/fc15ed] Submitted process > trinityInchwormChrysalis
WARN: [LSF] queue status cannot be fetched
- cmd executed: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader
- exit status : 255
- output      :
  bjobs: illegal option -- o
  Usage: bjobs [-A] [-a] [-aps| -l | -w ] [-d] [-p] [-s] [-r] [-W] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-N host_name | -N host_model | -N cpu_factor]
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-a] [-aps] [-d] [-p] [-s] [-r] [-WL] [-WF] [-WP] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-a] [-l | -w] [-d] [-p] [-s] [-ss] [-r] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-h] [-V]

               [-WL | -WP | -WF] output key:
               X: The real run time has exceeded the configured run time
               L: A run limit exists but job does not have a configured run time
               E: A configured run time exists and has not been exceeded

nextflowlog.txt

tomas-pluskal commented 5 years ago

I guess the -o option is not supported in our LSF version:

$ bjobs -V
Platform LSF 8.0.1.174931, Jun 13 2011
Copyright 1992-2011 Platform Computing Corporation

  binary type: linux2.6-glibc2.3-x86_64

pditommaso commented 5 years ago

Yeah, exactly. Worth nothing that in the latest NF version the command didn't changed. It's just showing the failing command that before was silently ignored.

pditommaso commented 5 years ago

Could you try to investigate with your sysadmins, if there's a legacy syntax as replacement for the -o option?

tomas-pluskal commented 5 years ago

What exactly do you need the -o option for?

tomas-pluskal commented 5 years ago

There's 'bjobs -w' which outputs more details and also 'bjobs -l':

   -w
               Wide format. Displays job information without
               truncating fields.

   -l
               Long format. Displays detailed information for
               each job in a multiline format.

               The -l option displays the following additional
               information: project name, job command, current
               working directory on the submission host, initial
               checkpoint period, checkpoint directory, migration
               threshold, pending and suspending reasons, job
               status, resource usage, resource usage limits
               information, runtime resource usage information on
               the execution hosts, and job description

               If the job was submitted with bsub -K, the -l
               option displays Synchronous Execution.

               Use bjobs -A -l to display detailed information
               for job arrays including job array job limit (%
               job_limit) if set.

               Use bjobs -ss -l to display detailed information
               for session scheduler jobs.

               If JOB_IDLE is configured in the queue, use bjobs
               -l to display job idle exception information.

               If you submitted your job with the -U option to
               use advance reservations created with the brsvadd
               command, bjobs -l shows the reservation ID used by
               the job.

pditommaso commented 5 years ago

To format the output only with the JOBID, STAT, and SUBMIT_TIME columns. For example:

6795348,RUN,Feb 17 13:26
6795349,RUN,Feb 17 13:26
6795351,PEND,Feb 17 13:26
6795353,PSUSP,Feb 17 13:26
6795354,EXIT,Feb 17 13:26

tomas-pluskal commented 5 years ago

Then you should be able to use just plain 'bjobs' and parse the relevant columns... just a bit of extra work :)

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5062199 pluskal RUN   normal     it-c01b14   it-c05b02   *Chrysalis Oct 28 13:57
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
                                             it-c05b02
5062198 pluskal RUN   normal     tak         it-c01b14   *sminer.nf Oct 28 13:57

pditommaso commented 5 years ago

The problem is not to break existing deployment. Not sure the above format is consistent across different LSF versions.

tomas-pluskal commented 5 years ago

Here's more info about bjobs output: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/bjobs.zz4category.description.1.html

I think it's fair to assume that the name and order of columns might change between LSF versions. But if you parse the first line and find the position of the JOBID, STAT, and SUBMIT_TIME columns, that should be fairly robust.

pditommaso commented 5 years ago

@tdelhomme @mfoll @sb10 @micans Can I ask your help to solve this issue with LSF. Could you please submit a couple of dummy jobs and copy here the exact output of the command bjobs -u $USER followed by bjobs -V ? thanks a lot.

tdelhomme commented 5 years ago

With the snapshot NXF_VER=18.11.0-SNAPSHOT ?

pditommaso commented 5 years ago

Any, or even w/o nextflow, just submit a couple of just and then check the jobs status with the above command.

micans commented 5 years ago

two jobs ..

bc-3,cellgen/cellgeni, bjobs -u $USER
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9025068 svd     UNKWN long       farm3-head5 bc-31-1-12  bash       Oct 10 10:54
9958620 svd     RUN   long       farm3-head5 bc-30-2-03  bash       Oct 29 10:10
bc-3,cellgen/cellgeni, bjobs -V
IBM Platform LSF 9.1.3.0 build 238688, Jul 11 2014
Copyright IBM Corp. 1992, 2014. All rights reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

  binary type: linux2.6-glibc2.3-x86_64

tdelhomme commented 5 years ago

[delhommet@hn ~]$  bjobs -u $USER
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
960417  delhomm RUN   normal     hn          cn01        *-125_154) Oct 29 13:37
960419  delhomm RUN   normal     hn          cn01        *-125_154) Oct 29 13:37

[delhommet@hn ~]$  bjobs -V
IBM Platform LSF 9.1.1.0 build 213132, Feb 27 2013
Copyright International Business Machines Corp, 1992-2013.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

  binary type: linux2.6-glibc2.3-x86_64

[delhommet@hn ~]$ nextflow -version

      N E X T F L O W
      version 18.10.1 build 5003
      last modified 24-10-2018 14:03 UTC (16:03 CEST)
      cite doi:10.1038/nbt.3820
      http://nextflow.io

pditommaso commented 5 years ago

@tomas-pluskal Could you please post the output of the command bjobs -w showing a few jobs ?

tomas-pluskal commented 5 years ago

Sure:

$ bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5085109 pluskal RUN   normal     it-c01b13   it-c01b11   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_4 Oct 31 13:00
5085585 pluskal RUN   normal     tak         it-c05b12   phenix.autobuild autobuild_config.eff Oct 31 14:05
5085604 pluskal RUN   normal     it-c05b12   it-c05b12   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_1 Oct 31 14:44
5085611 pluskal RUN   normal     it-c05b12   it-c05b12   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_8 Oct 31 14:44
5085107 pluskal RUN   normal     it-c01b13   it-c01b07   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_2 Oct 31 13:00
5085607 pluskal RUN   normal     it-c05b12   it-c01b07   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_4 Oct 31 14:44
5085608 pluskal RUN   normal     it-c05b12   it-c01b05   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_5 Oct 31 14:44
5085609 pluskal RUN   normal     it-c05b12   it-c01b14   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_6 Oct 31 14:44
5085702 pluskal RUN   normal     tak         it-c01b14   ./assemble.nf Oct 31 15:08
5085610 pluskal RUN   normal     it-c05b12   it-c05b07   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_7 Oct 31 14:44
5085108 pluskal RUN   normal     it-c01b13   it-c01b12   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_3 Oct 31 13:00
5085606 pluskal RUN   normal     it-c05b12   it-c01b12   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_3 Oct 31 14:44
5085034 pluskal RUN   normal     tak         it-c01b13   phenix.autobuild autobuild_config.eff Oct 31 12:56
5085106 pluskal RUN   normal     it-c01b13   it-c01b13   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild/AutoBuild_run_1_/TEMP0/RUN_FILE_1 Oct 31 13:00
5085605 pluskal RUN   normal     it-c05b12   it-c01b13   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_2 Oct 31 14:44
5085612 pluskal RUN   normal     it-c05b12   it-c01b13   /lab/solexa_weng/xtal_data/Data_processing/Tomas_Pluskal/20181031_KLR1_autobuild_not_in_place/AutoBuild_run_1_/TEMP0/RUN_FILE_9 Oct 31 14:44
5085703 pluskal RUN   normal     it-c01b14   it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31:it-r16u31 nf-trinityInchwormChrysalis Oct 31 15:08

pditommaso commented 5 years ago

Thanks. I'm wondering if the JOBID can be truncated without the -w option?

tomas-pluskal commented 5 years ago

No, I think the JOBID is always printed in full:

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5085109 pluskal RUN   normal     it-c01b13   it-c01b11   *UN_FILE_4 Oct 31 13:00
5085585 pluskal RUN   normal     tak         it-c05b12   *onfig.eff Oct 31 14:05
5085107 pluskal RUN   normal     it-c01b13   it-c01b07   *UN_FILE_2 Oct 31 13:00
5085702 pluskal RUN   normal     tak         it-c01b14   *semble.nf Oct 31 15:08
5085610 pluskal RUN   normal     it-c05b12   it-c05b07   *UN_FILE_7 Oct 31 14:44
5085108 pluskal RUN   normal     it-c01b13   it-c01b12   *UN_FILE_3 Oct 31 13:00
5085034 pluskal RUN   normal     tak         it-c01b13   *onfig.eff Oct 31 12:56
5085106 pluskal RUN   normal     it-c01b13   it-c01b13   *UN_FILE_1 Oct 31 13:00
5085703 pluskal RUN   normal     it-c01b14   it-r16u31   *Chrysalis Oct 31 15:08
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31
                                             it-r16u31

pditommaso commented 5 years ago

Not so sure, googling a bit it looks the job id can take up to 10 digits.

I won't be surprised that it's truncated when taking more than 7 digits as shown in the above table.

tomas-pluskal commented 5 years ago

You're right. Apparently, this can be configured: https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_config_ref/lsf.conf.lsb_jobid_disp_length.5.html

pditommaso commented 5 years ago

gotcha!

pditommaso commented 5 years ago

I've uploaded a snapshot with this path. You can try it running:

NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc

tomas-pluskal commented 5 years ago

It still displays the same warning for me:

WARN: [LSF] queue status cannot be fetched
- cmd executed: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader
- exit status : 255
- output      :
  bjobs: illegal option -- o
  Usage: bjobs [-A] [-a] [-aps| -l | -w ] [-d] [-p] [-s] [-r] [-W] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-N host_name | -N host_model | -N cpu_factor]
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-a] [-aps] [-d] [-p] [-s] [-r] [-WL] [-WF] [-WP] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-a] [-l | -w] [-d] [-p] [-s] [-ss] [-r] [-X] [-x]
               [-app application_profile_name] [-g job_group_name]
               [-sla service_class_name] [-J job_name] [-Jd job_description] [-Lp ls_project_name]
               [-m "host_name ..." | -m host_group | -m cluster_name] 
               [-P project_name] [-q queue_name]
               [-u user_name | -u user_group | -u all | -G user_group]
               [jobId | "jobId[index_list]" ...]

         bjobs [-h] [-V]

               [-WL | -WP | -WF] output key:
               X: The real run time has exceeded the configured run time
               L: A run limit exists but job does not have a configured run time
               E: A configured run time exists and has not been exceeded

pditommaso commented 5 years ago

Can you double check you have the right version. The following command should print:

NXF_VER=18.11.0-SNAPSHOT nextflow info
  Version: 18.11.0-SNAPSHOT build 5014
  Modified: 31-10-2018 17:32 UTC (18:32 CEST)

tomas-pluskal commented 5 years ago

Seems like it doesn't update:

$ NXF_VER=18.11.0-SNAPSHOT nextflow info
  Version: 18.11.0-SNAPSHOT build 5011
  Modified: 28-10-2018 17:45 UTC (13:45 EDT)

Why could that be?

pditommaso commented 5 years ago

If so run:

NXF_VER=18.11.0-SNAPSHOT CAPSULE_RESET=1 nextflow info

then:

NXF_VER=18.11.0-SNAPSHOT nextflow run .. etc

tomas-pluskal commented 5 years ago

Yes, that helped. It does not show the warning anymore.

Thanks for the troubleshooting!

tomas-pluskal commented 5 years ago

Hi, I noticed that my nextflow jobs now fail with errors like this:

[warm up] executor > lsf
[af/9267bf] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'

Caused by:
 Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system

This is my nextflow version and nextflow config:

$ NXF_VER=18.11.0-SNAPSHOT nextflow info
  Version: 18.11.0-SNAPSHOT build 5015
  Modified: 12-11-2018 16:50 UTC (11:50 EDT)
  System: Linux 3.13.0-151-generic
  Runtime: Groovy 2.5.3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12
  Encoding: UTF-8 (UTF-8)

$ NXF_VER=18.11.0-SNAPSHOT nextflow config
process {
   executor = 'lsf'
}

executor {
   queueSize = 1000
}

env {
   PATH = '/lab/solexa_weng/testtube/trinityrnaseq-Trinity-v2.8.4:/lab/solexa_weng/testtube/TransDecoder-TransDecoder-v5.0.2:/lab/solexa_weng/testtube/transrate-1.0.3-linux-x86_64:/lab/solexa_weng/testtube/signalp-4.1:/lab/solexa_weng/testtube/tmhmm-2.0c/bin:/lab/solexa_weng/testtube/ncbi-blast-2.7.1+/bin:/lab/solexa_weng/testtube/bowtie2-2.3.4.3-linux-x86_64:/lab/solexa_weng/testtube/nextflow-0.32.0:/usr/local/jdk1.8/bin:/home/pluskal/.gem/ruby/2.3.0/bin:/lab/solexa_weng/testtube/miniconda3/bin:/home/pluskal/bin:/nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/etc:/nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:::/usr/local/genome/bin'
}

report {
   enabled = true
}

Strangely enough, the job that is reported as "terminated for an unknown reason" is still running on the LSF cluster:

$ bjobs -w 
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5132674 pluskal RUN   normal     it-c01b07   it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 00:16

I suppose there might still be some issues with handling the LSF queue..?

I have attached the .nextflow.log file

nextflow_log.txt

tomas-pluskal commented 5 years ago

@pditommaso I noticed in the log file that the nextflow process crashed exactly 5 minutes after starting. I believe 5 min is the timer for checking the LSF queue, so it definitely has something to do with that:

Nov-13 00:21:33.104 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor lsf > tasks to be completed: 1 -- pending tasks are shown below
~> TaskHandler[jobId: 5132674; id: 1; name: trinityInchwormChrysalis; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 started: 1542086223123; exited: -; ]
Nov-13 00:23:19.147 [Task monitor] DEBUG nextflow.file.FileHelper - Can't list folder: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 -- Exit status: 143
Nov-13 00:26:18.254 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exist status for process TaskHandler[jobId: 5132674; id: 1; name: trinityInchwormChrysalis; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/af/9267bf1472eb3edbb83fd42040de86 started: 1542086223123; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 370103
Current queue status:
>   (null)

pditommaso commented 5 years ago

Can you please try using the 18.11.0-edge release and attaching here the complete .nextflow.log file?

tomas-pluskal commented 5 years ago

I tried it: NXF_VER=18.11.0-edge nextflow assemble.nf

It produces the same error:

[warm up] executor > lsf
[21/283e00] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'

Caused by:
 Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system

The log file is attached: nextflow_log.txt

pditommaso commented 5 years ago

But the actual task is taking less than 5 mins or it's still running?

tomas-pluskal commented 5 years ago

It is still running.

pditommaso commented 5 years ago

what's the output of bjobs -w ?

tomas-pluskal commented 5 years ago

$ bjobs -w
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5132676 pluskal RUN   normal     tak         it-c01b04   /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137332 pluskal RUN   normal     it-c01b11   it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 11:07
5137357 pluskal RUN   normal     it-c01b11   it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 11:35

pditommaso commented 5 years ago

Instead if you use bjobs -w -q <the queue name you put in the config file> ?

tomas-pluskal commented 5 years ago

I did not specify any queue in the config file. All jobs run in the default queue, I guess.

pditommaso commented 5 years ago

Something strange is happening. Please execute one more time using this command:

NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf

Then upload again the log file when it stops.

tomas-pluskal commented 5 years ago

Strangely enough, when I started it with this command, it doesn't seem to crash. I have attached the current log file (but nextflow is still running, as it should be).

nextflow_log.txt

tomas-pluskal commented 5 years ago

Ok, it did crash in the end. But it took much longer time.

[warm up] executor > lsf
[ba/cbcdea] Submitted process > trinityInchwormChrysalis
ERROR ~ Error executing process > 'trinityInchwormChrysalis'

Caused by:
 Process `trinityInchwormChrysalis` terminated for an unknown reason -- Likely it has been terminated by the external system

The log is attached below.

nextflow_log.txt

From the log it seems that bjobs -w returned no output, but exited with exit code 0. That is a bit strange, I have never seen that happen when I run bjobs manually. Perhaps nextflow could double check if bjobs properly printed the header line (JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME).

pditommaso commented 5 years ago

It looks there's a hiccup with your cluster. NF periodically run bjobs to fetch tasks status, eg.

Nov-13 12:41:59.838 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:41:59.861 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5132676 pluskal RUN   normal     tak         it-c01b04   /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137404 pluskal RUN   normal     it-c03b03   it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 12:18
5137399 pluskal RUN   normal     tak         it-c03b03   NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf Nov 13 12:17

Some minutes later bjobs returns an empty result:

Nov-13 12:42:59.840 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:42:59.861 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0

However the jobs it's still running, indeed later in the log, it appears again:

Nov-13 12:44:59.853 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] getting queue status > cmd: bjobs -w
Nov-13 12:44:59.870 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > cmd exit: 0
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5132676 pluskal RUN   normal     tak         it-c01b04   /lab/solexa_weng/testtube/transcriptome/check_strandedness.sh SRR2006304_1.fastq.gz SRR2006307_1_fixed.fastq.gz Nov 13 00:41
5137404 pluskal RUN   normal     it-c03b03   it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy:it-bigboy nf-trinityInchwormChrysalis Nov 13 12:18
5137399 pluskal RUN   normal     tak         it-c03b03   NXF_VER=18.11.0-edge nextflow -trace nextflow.executor run assemble.nf Nov 13 12:17

However the empty result makes NF to mark that jobs as into an inconsistent status. It set a timeout make it to fail fast when as soon the job exit (successfully) later, because the .exitcode is not created (yet apparently):

Content of workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/ba/cbcdeac0e24debb012453375935c56
> total 204
> drwxrwsr-x 3 pluskal wenglab  306 Nov 13 12:18 .
> drwxrwsr-x 3 pluskal wenglab   48 Nov 13 12:18 ..
> -rw-rw-r-- 1 pluskal wenglab    0 Nov 13 12:18 .command.begin
> -rw-rw-r-- 1 pluskal wenglab    0 Nov 13 12:18 .command.err
> -rw-rw-r-- 1 pluskal wenglab 7626 Nov 13 13:04 .command.log
> -rw-rw-r-- 1 pluskal wenglab 7626 Nov 13 13:04 .command.out
> -rw-rw-r-- 1 pluskal wenglab 2921 Nov 13 12:18 .command.run
> -rw-rw-r-- 1 pluskal wenglab  247 Nov 13 12:18 .command.sh
> -rw-rw-r-- 1 pluskal wenglab 3500 Nov 13 12:18 .command.stub
> -rw-rw-r-- 1 pluskal wenglab  210 Nov 13 13:08 .command.trace
> lrwxrwxrwx 1 pluskal wenglab   53 Nov 13 12:18 samples.txt -> /lab/weng_scratch/Tomas/nextflow_assemble/samples.txt
> drwxrwsr-x 3 pluskal wenglab  705 Nov 13 13:04 trinity_out_dir

pditommaso commented 5 years ago

I'm closing this issue and open a separate one to track this problem.

tomas-pluskal commented 5 years ago

Just wanted to mention that the bjobs command may occasionally print out additional messages like LSF is processing your request. Please wait ..., as shown below. This is something nextflow should anticipate.

$ bjobs
LSF is processing your request. Please wait ...
LSF is processing your request. Please wait ...
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5157393 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_bd) Nov 14 13:00
5157493 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_cn) Nov 14 13:00
5157552 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_dl) Nov 14 13:00
5157610 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_fe) Nov 14 13:00
5157674 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_gf) Nov 14 13:00
5157710 pluskal RUN   normal     it-c05b07   it-c05b10   *l_cmd_fv) Nov 14 13:00

pditommaso commented 5 years ago

This output is already parsed correctly. Thanks for point it out in any case.

pachiras commented 5 years ago

PBS Pro gives me the same messages but different exit status.

$ ./main.nf

N E X T F L O W  ~  version 18.10.1
Launching './main.nf' [hungry_mclean] - revision: b639ebf99c
[warm up] executor > pbs
[8b/2793ea] Submitted process > P00_ReadDecompression (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2

[1f/7b5751] Submitted process > P01_FullMapping (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

[13/71ad23] Submitted process > P02_FullMappingExtraction (1)
WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

WARN: [PBS] queue status cannot be fetched > exit status: 2

[3d/29877d] Submitted process > P03_Bam2Fastq (1)
[cb/6789a0] Submitted process > P03_Bam2Fastq (2)

pditommaso commented 5 years ago

@pachiras Please report this in a different issue including the .nextflow.log file.

nextflow-io / nextflow

Queue status command fail on LSF version 8 #910

Bug report