Grid queue status return wrong data when a process change the submission queue dynamically

sergpolly commented 7 years ago

Consider simplistic pipeline:

in1 = Channel.from( 25, )
in2 = Channel.from( 55, )

process proc1 {
    tag "sleep $ccc in proc1"
    executor 'lsf' 
    queue { task.attempt<=1 ? 'short' : 'long' }
    time { task.attempt<=1 ? '20m' : '5h' }
    maxRetries 2
    errorStrategy 'retry'
    input:
    val ccc from in1

    script:
    """
    echo 'will be sleeping for ${ccc} minutes'
    sleep ${ccc}m
    """
}

process proc2 {
    tag "sleep $ccc in proc2"
    executor 'lsf' 
    queue { task.attempt<=1 ? 'short' : 'long' }
    time { task.attempt<=1 ? '50m' : '5h' }
    maxRetries 2
    errorStrategy 'retry'
    input:
    val ccc from in2

    script:
    """
    echo 'will be sleeping for ${ccc} minutes'
    sleep ${ccc}m
    """
}

proc1 would fail after 20 minutes due to TIMELIMIT and will be re-submitted by nextflow to the long queue. At the same time proc2 would keep running in a short queue for 30 more minutes. For some time jobs corresponding to the re-submitted proc1 are not visible to nextflow because nextflow inquires about the status of the submitted jobs using command: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short But jobs for re-submitted proc1are on a different queue now: -q long.

Such a mixed-queue situation may lead to terminated for an unknown reason error. Especially if nextflow decides to inquire the .exitcode of the jobs associated with re-submitted proc1.

pditommaso commented 7 years ago

Interesting, what evidence do you have for this issue?

sergpolly commented 7 years ago

This particular case didn't fail, but I can show how re-submitted process was invisible to nextflow:

Right before failing due to TIMELIMIT:

Jul-28 00:04:37.856 [Task monitor] TRACE n.executor.AbstractGridExecutor - JobId `4728290` active status: true
Jul-28 00:04:42.854 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short
Jul-28 00:04:42.977 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0

TIMELIMIT failing ...

Jul-28 00:04:47.868 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 4728287; id: 3; name: proc1 (sleep 25 in proc1); status: COMPLETED; exit: 140; error: -; workDir: /home/sv49w/NXF_LSF_bug/work/b1/98aca7e3dd472ce9da2c1c9443f884 started: 1501213472774; exited: 2017-07-28T04:04:45.380667Z; ]
Jul-28 00:04:47.904 [Task monitor] WARN  nextflow.processor.TaskProcessor - Process `proc1 (sleep 25 in proc1)` terminated with an error exit status (140) -- Execution is retried (1)
Jul-28 00:04:47.943 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short
Jul-28 00:04:48.149 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0

resubmission:

Jul-28 00:04:48.432 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - Submitted process proc1 (sleep 25 in proc1) > lsf jobId: 4728293; workDir: /home/sv49w/NXF_LSF_bug/work/d8/cb1deb6e67b4d47cf9aa4998e5e7af
Jul-28 00:04:48.432 [Task submitter] INFO  nextflow.Session - [d8/cb1deb] Re-submitted process > proc1 (sleep 25 in proc1)
Jul-28 00:04:52.854 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status:

invisibility:

Jul-28 00:04:52.865 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status map does not contain jobId: `4728293`
Jul-28 00:04:57.855 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status:

after some time nextflow starts inquiring about the long queue:

Jul-28 00:05:42.857 [Task monitor] TRACE n.executor.AbstractGridExecutor - JobId `4728289` active status: true
Jul-28 00:05:42.858 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q long
Jul-28 00:05:42.982 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0

sergpolly commented 7 years ago

My big pipeline failed again - see gitter. The difference was that the "big" pipeline also asked the exit-status of a job that was invisible for nextflow

pditommaso commented 7 years ago

Can you upload the log file here ?

sergpolly commented 7 years ago

nnn.txt

sergpolly commented 7 years ago

the exact simple-pipeline that I ran is here https://github.com/sergpolly/NXF_LSF_bug - just in case

pditommaso commented 7 years ago

Yes, you are right, there's a hole of about 30 second in which the executor doesn't see that the job was submitted to a different queue. Good spot! I'm going to patch soon.

However I'm still not 100% sure that this is the reason that cases the failure in your big pipeline. Do you have the detailed log for that one?

sergpolly commented 7 years ago

Here is the piece of log file starting from 10 mins before the error around 00:49(log-time) https://pastebin.com/krv21au3

sergpolly commented 7 years ago

I was texting some stuff on nextflow-gitter - have some details there

pditommaso commented 7 years ago

I need some coffee before enter on gitter sorry :) I will upload a new snapshot later today. Thanks for the effort.

pditommaso commented 7 years ago

I've uploaded a patch that should solve the problem. Please run the following to update your current version :

NXF_VER=0.25.3-SNAPSHOT CAPSULE_RESET=1 nextflow info

It should print:

  Version: 0.25.3-SNAPSHOT build 4503
  Modified: 28-07-2017 10:42 UTC (12:42 CEST)

Then launch your pipeline with the command

NXF_VER=0.25.3-SNAPSHOT nextflow -trace nextflow.executor.AbstractGridExecutor run .. etc

When done please share with me the produce log file.

sergpolly commented 7 years ago

I re-run the small example-pipeline: updated_nnn.txt

nextflow is much much more elaborate now in terms of checking the job status in different queues, but there is still a small window time when it couldn't find the right info

pditommaso commented 7 years ago

Did it fail ?

sergpolly commented 7 years ago

No, small pipeline never failed.

This time, I thinks the delay explained by a lag between job submission and the moment this jobs starts to show up in the output of bjobs

pditommaso commented 7 years ago

It looks solved. I will the test with your big pipeline. If it's OK I will upload a new release.

sergpolly commented 7 years ago

0.25.3 was unusual in a sense that it resubmitted my failing job 3 times instead of 2, that I required. Because of that, my "big" pipeline is still running and I don't want to kill it just yet (I want the processed data). I'll resubmit it for testing purposes after it finishes

sergpolly commented 7 years ago

big pipeline would take ~8-10 hours to get to the point of dynamic queue switch, - it's not quick

pditommaso commented 7 years ago

No hurry. I will wait for your feedback.

sergpolly commented 7 years ago

just a couple of quick questions before I proceed: 1) just to make sure, does "resubmitted my failing job 3 times instead of 2" sounds OK for 0.25.3? it was different in previous version - it would kill everything instead. 2) Do nextflow allow for a lag between submission and the moment when job starts to appear in bjobs output? This lag could be up to ~1 min in our LSF cluster.

pditommaso commented 7 years ago

It should. The wrong behaviour was process xxx terminated for an unknown reason
Yup

sergpolly commented 7 years ago

So, if i understood correctly, process xxx terminated for an unknown reason was so wrong that it deserves an extra retry by nextflow. (correct me if I'm wrong(?)) Sounds great to me as long as it is an expected behaviour.

Proceeding with further tests. Thank you, Paolo

pditommaso commented 7 years ago

What was happening is that NF was querying the job status on the previous queue even when it was submitted to a different one. Since that job was not found, it was assumed it was terminated. Hence it started to wait for the .exitcode termination file. Which was not found because in reality the job was still running, but in the undefined condition NF force the job terminate causing the pipeline to stop.

sergpolly commented 7 years ago

That was the scenario for nextflow 0.24.... In case of 0.25.3 , nextflow would do everything that you are saying but than it would re-submit that terminated job, again (in my case - 3rd time) - instead of stopping the pipeline. I'm still puzzled if this is an expected behavior or not?

pditommaso commented 7 years ago

I'm starting to be confused between different versions. Please to the tests, then we will discuss the results.

sergpolly commented 7 years ago

Allright

sergpolly commented 7 years ago

I think issue in the title Grid queue status return wrong data when a process change the submission queue dynamically is resolved, here are the logs from the big pipeline run: https://www.dropbox.com/s/x3b5g97h53pv9fc/fixed_nnn.txt?dl=0

nextflow handles dynamic queue switching beautifully. Thank you, Paolo!

pditommaso commented 7 years ago

Released with version 0.25.3.

pditommaso commented 7 years ago

@sergpolly This bug was really nasty. Thanks a lot for your detailed report!

IdoBar commented 3 years ago

@pditommaso I'm having the same problem on the QLD HPC system (QRIS Awoonga, running PBSpro), jobs that gets redirected to another queue (on another server) are being referred to as "failed" with this error message terminated for an unknown reason -- Likely it has been terminated by the external system, even tough they complete successfully with exit status 0 (which nextflow is unable to get). These are the relevant messages in the log file:

Dec.-25 09:37:12.451 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 510838.awonmgr2; id: 122; name: genotyping_freebayes (D10); status: RUNNING; exit: -; error: -; workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708 started: 1608852607445; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270020
Current queue status:
>   job: 510842.awonmgr2: PENDING
>   job: 510843.awonmgr2: PENDING
>   job: 510844.awonmgr2: PENDING
>   job: 510845.awonmgr2: PENDING
>   job: 510846.awonmgr2: PENDING
>   job: 510847.awonmgr2: PENDING
>   job: 510848.awonmgr2: PENDING
>   job: 510849.awonmgr2: PENDING
>   job: 510850.awonmgr2: PENDING
>   job: 510851.awonmgr2: PENDING
>   job: 510852.awonmgr2: PENDING
>   job: 510853.awonmgr2: PENDING
>   job: 510854.awonmgr2: PENDING
>   job: 510855.awonmgr2: PENDING

Content of workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708
null
Dec.-25 09:37:12.452 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 510838.awonmgr2; id: 122; name: genotyping_freebayes (D10); status: COMPLETED; exit: -; error: -; workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708 started: 1608852607445; exited: -; ]
Dec.-25 09:37:12.468 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.out
Dec.-25 09:37:12.482 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.err
Dec.-25 09:37:12.483 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.log
Dec.-25 09:37:12.486 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'genotyping_freebayes (D10)'
Caused by:
  Process `genotyping_freebayes (D10)` terminated for an unknown reason -- Likely it has been terminated by the external system
Command executed:
  freebayes -f CanFam3.1.fasta -p 2 -C 5 -g 0 D10.trimmed.bam > D10.freebayes.vcf
  pigz -p 2 D10.freebayes.vcf
Command exit status:
  -
Command output:
  (empty)

Work dir:
  /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Dec.-25 09:37:12.493 [Task monitor] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Dec.-25 09:37:12.551 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.err
Dec.-25 09:37:12.552 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.out

And this is the process qstat summary:

Job Id: 510838.awonmgr2
    Job_Name = nf-genotyping_f
    Job_Owner = ibar@awoonga1.local
    job_state = M
    queue = Short@flashmgr2
    server = awonmgr2
    Account_Name = qris-gu
    Checkpoint = u
    ctime = Fri Dec 25 09:28:22 2020
    Error_Path = awoonga1.local:/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/nf-genotyping_f.e510838
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Fri Dec 25 09:29:59 2020
    Output_Path = awoonga1.local:/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.log
    Priority = 0
    qtime = Fri Dec 25 09:28:22 2020
    Rerunable = True
    Resource_List.interact1 = 0
    Resource_List.mem = 4096mb
    Resource_List.ncpus = 2
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:ncpus=2:mem=4096mb
    Resource_List.walltime = 04:00:00
    substate = 92
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,PBS_O_HOME=/home/ibar,PBS_O_LOGNAME=ibar,PBS_O_WORKDIR=/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708,PBS_O_LANG=en_AU.UTF-8,PBS_O_PATH=/home/ibar/.pyenv/bin:/sw/Containers/singularity/3.5.0/bin:/sw/Containers/singularity/images:/home/ibar/.pyenv/bin:/home/ibar/bin:/home/ibar/.pyenv/versions/miniconda-latest/bin:/home/ibar/.pyenv/versions/miniconda-latest/condabin:/home/ibar/.pyenv/bin:/home/ibar/bin:/opt/gnu/gcc/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/pbs/bin:/opt/rocks/bin:/opt/rocks/sbin:/home/ibar/etc/tools/Annotation/BLAT:/opt/pbs/bin:/home/ibar/etc/tools/Annotation/BLAT,PBS_O_MAIL=/var/spool/mail/ibar,PBS_O_QUEUE=workq,PBS_O_HOST=awoonga1.local
    comment = Job finished at "flashmgr2"
    etime = Fri Dec 25 09:28:22 2020
    eligible_time = 35:46:34
    Submit_arguments = -N nf-genotyping_f .command.run
    history_timestamp = 1608852599
    project = _pbs_project_default

And the content of .command.log:

########################### Execution Started #############################
JobId:510838.awonmgr2
UserName:ibar
GroupName:qris-gu
ExecutionHost:fl118
###############################################################################
nxf-scratch-dir fl118.local:/nvme/pbs/tmpdir/pbs.510838.awonmgr2/nxf.urortU9s5E
########################### Job Execution History #############################
JobId:510838.awonmgr2
UserName:ibar
GroupName:qris-gu
JobName:nf-genotyping_f
SessionId:11147
ResourcesRequested:mem=4096mb,ncpus=2,place=free,walltime=04:00:00
ResourcesUsed:cpupercent=89,cput=00:12:28,mem=158040kb,ncpus=2,vmem=1129996kb,walltime=00:12:25
QueueUsed:Short
AccountString:qris-gu
ExitStatus:0
###############################################################################

This happens when jobs are being redirected from the main server (awonmgr2) to the flashmgr2 server. Happy to provide additional information if needed.

Thanks, Ido

pditommaso commented 3 years ago

Please open a new issue for that.

nextflow-io / nextflow

Grid queue status return wrong data when a process change the submission queue dynamically #412