Closed sergpolly closed 7 years ago
Interesting, what evidence do you have for this issue?
This particular case didn't fail, but I can show how re-submitted process was invisible to nextflow
:
Right before failing due to TIMELIMIT:
Jul-28 00:04:37.856 [Task monitor] TRACE n.executor.AbstractGridExecutor - JobId `4728290` active status: true
Jul-28 00:04:42.854 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short
Jul-28 00:04:42.977 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0
TIMELIMIT failing ...
Jul-28 00:04:47.868 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 4728287; id: 3; name: proc1 (sleep 25 in proc1); status: COMPLETED; exit: 140; error: -; workDir: /home/sv49w/NXF_LSF_bug/work/b1/98aca7e3dd472ce9da2c1c9443f884 started: 1501213472774; exited: 2017-07-28T04:04:45.380667Z; ]
Jul-28 00:04:47.904 [Task monitor] WARN nextflow.processor.TaskProcessor - Process `proc1 (sleep 25 in proc1)` terminated with an error exit status (140) -- Execution is retried (1)
Jul-28 00:04:47.943 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short
Jul-28 00:04:48.149 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0
resubmission:
Jul-28 00:04:48.432 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - Submitted process proc1 (sleep 25 in proc1) > lsf jobId: 4728293; workDir: /home/sv49w/NXF_LSF_bug/work/d8/cb1deb6e67b4d47cf9aa4998e5e7af
Jul-28 00:04:48.432 [Task submitter] INFO nextflow.Session - [d8/cb1deb] Re-submitted process > proc1 (sleep 25 in proc1)
Jul-28 00:04:52.854 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status:
invisibility:
Jul-28 00:04:52.865 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status map does not contain jobId: `4728293`
Jul-28 00:04:57.855 [Task monitor] TRACE n.executor.AbstractGridExecutor - Queue status:
after some time nextflow
starts inquiring about the long
queue:
Jul-28 00:05:42.857 [Task monitor] TRACE n.executor.AbstractGridExecutor - JobId `4728289` active status: true
Jul-28 00:05:42.858 [Task monitor] TRACE n.executor.AbstractGridExecutor - Getting grid queue status: bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q long
Jul-28 00:05:42.982 [Task monitor] TRACE n.executor.AbstractGridExecutor - LSF status result > exit: 0
My big pipeline failed again - see gitter.
The difference was that the "big" pipeline also asked the exit-status of a job that was invisible for nextflow
Can you upload the log file here ?
the exact simple-pipeline that I ran is here https://github.com/sergpolly/NXF_LSF_bug - just in case
Yes, you are right, there's a hole of about 30 second in which the executor doesn't see that the job was submitted to a different queue. Good spot! I'm going to patch soon.
However I'm still not 100% sure that this is the reason that cases the failure in your big pipeline. Do you have the detailed log for that one?
Here is the piece of log file starting from 10 mins before the error around 00:49(log-time) https://pastebin.com/krv21au3
I was texting some stuff on nextflow-gitter - have some details there
I need some coffee before enter on gitter sorry :) I will upload a new snapshot later today. Thanks for the effort.
I've uploaded a patch that should solve the problem. Please run the following to update your current version :
NXF_VER=0.25.3-SNAPSHOT CAPSULE_RESET=1 nextflow info
It should print:
Version: 0.25.3-SNAPSHOT build 4503
Modified: 28-07-2017 10:42 UTC (12:42 CEST)
Then launch your pipeline with the command
NXF_VER=0.25.3-SNAPSHOT nextflow -trace nextflow.executor.AbstractGridExecutor run .. etc
When done please share with me the produce log file.
I re-run the small example-pipeline: updated_nnn.txt
nextflow
is much much more elaborate now in terms of checking the job status in different queues,
but there is still a small window time when it couldn't find the right info
Did it fail ?
No, small pipeline never failed.
This time, I thinks the delay explained by a lag between job submission and the moment this jobs starts to show up in the output of bjobs
It looks solved. I will the test with your big pipeline. If it's OK I will upload a new release.
0.25.3
was unusual in a sense that it resubmitted my failing job 3 times instead of 2, that I required.
Because of that, my "big" pipeline is still running and I don't want to kill it just yet (I want the processed data). I'll resubmit it for testing purposes after it finishes
big pipeline would take ~8-10 hours to get to the point of dynamic queue switch, - it's not quick
No hurry. I will wait for your feedback.
just a couple of quick questions before I proceed:
1) just to make sure, does "resubmitted my failing job 3 times instead of 2" sounds OK for 0.25.3
? it was different in previous version - it would kill everything instead.
2) Do nextflow
allow for a lag between submission and the moment when job starts to appear in bjobs
output? This lag could be up to ~1 min in our LSF cluster.
process xxx terminated for an unknown reason
So, if i understood correctly, process xxx terminated for an unknown reason
was so wrong that it deserves an extra retry
by nextflow
. (correct me if I'm wrong(?))
Sounds great to me as long as it is an expected behaviour.
Proceeding with further tests. Thank you, Paolo
What was happening is that NF was querying the job status on the previous queue even when it was submitted to a different one. Since that job was not found, it was assumed it was terminated. Hence it started to wait for the .exitcode
termination file. Which was not found because in reality the job was still running, but in the undefined condition NF force the job terminate causing the pipeline to stop.
That was the scenario for nextflow 0.24...
. In case of 0.25.3
, nextflow
would do everything that you are saying but than it would re-submit that terminated job, again (in my case - 3rd time) - instead of stopping the pipeline. I'm still puzzled if this is an expected behavior or not?
I'm starting to be confused between different versions. Please to the tests, then we will discuss the results.
Allright
I think issue in the title Grid queue status return wrong data when a process change the submission queue dynamically
is resolved, here are the logs from the big pipeline run:
https://www.dropbox.com/s/x3b5g97h53pv9fc/fixed_nnn.txt?dl=0
nextflow
handles dynamic queue switching beautifully.
Thank you, Paolo!
Released with version 0.25.3
.
@sergpolly This bug was really nasty. Thanks a lot for your detailed report!
@pditommaso I'm having the same problem on the QLD HPC system (QRIS Awoonga, running PBSpro
), jobs that gets redirected to another queue (on another server) are being referred to as "failed" with this error message terminated for an unknown reason -- Likely it has been terminated by the external system
, even tough they complete successfully with exit status 0 (which nextflow is unable to get).
These are the relevant messages in the log file:
Dec.-25 09:37:12.451 [Task monitor] DEBUG nextflow.executor.GridTaskHandler - Failed to get exit status for process TaskHandler[jobId: 510838.awonmgr2; id: 122; name: genotyping_freebayes (D10); status: RUNNING; exit: -; error: -; workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708 started: 1608852607445; exited: -; ] -- exitStatusReadTimeoutMillis: 270000; delta: 270020
Current queue status:
> job: 510842.awonmgr2: PENDING
> job: 510843.awonmgr2: PENDING
> job: 510844.awonmgr2: PENDING
> job: 510845.awonmgr2: PENDING
> job: 510846.awonmgr2: PENDING
> job: 510847.awonmgr2: PENDING
> job: 510848.awonmgr2: PENDING
> job: 510849.awonmgr2: PENDING
> job: 510850.awonmgr2: PENDING
> job: 510851.awonmgr2: PENDING
> job: 510852.awonmgr2: PENDING
> job: 510853.awonmgr2: PENDING
> job: 510854.awonmgr2: PENDING
> job: 510855.awonmgr2: PENDING
Content of workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708
null
Dec.-25 09:37:12.452 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 510838.awonmgr2; id: 122; name: genotyping_freebayes (D10); status: COMPLETED; exit: -; error: -; workDir: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708 started: 1608852607445; exited: -; ]
Dec.-25 09:37:12.468 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.out
Dec.-25 09:37:12.482 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.err
Dec.-25 09:37:12.483 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.log
Dec.-25 09:37:12.486 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'genotyping_freebayes (D10)'
Caused by:
Process `genotyping_freebayes (D10)` terminated for an unknown reason -- Likely it has been terminated by the external system
Command executed:
freebayes -f CanFam3.1.fasta -p 2 -C 5 -g 0 D10.trimmed.bam > D10.freebayes.vcf
pigz -p 2 D10.freebayes.vcf
Command exit status:
-
Command output:
(empty)
Work dir:
/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Dec.-25 09:37:12.493 [Task monitor] INFO nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Dec.-25 09:37:12.551 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.err
Dec.-25 09:37:12.552 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'genotyping_freebayes (D10)' -- Cause: java.nio.file.NoSuchFileException: /gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.out
And this is the process qstat
summary:
Job Id: 510838.awonmgr2
Job_Name = nf-genotyping_f
Job_Owner = ibar@awoonga1.local
job_state = M
queue = Short@flashmgr2
server = awonmgr2
Account_Name = qris-gu
Checkpoint = u
ctime = Fri Dec 25 09:28:22 2020
Error_Path = awoonga1.local:/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/nf-genotyping_f.e510838
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 25 09:29:59 2020
Output_Path = awoonga1.local:/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708/.command.log
Priority = 0
qtime = Fri Dec 25 09:28:22 2020
Rerunable = True
Resource_List.interact1 = 0
Resource_List.mem = 4096mb
Resource_List.ncpus = 2
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.select = 1:ncpus=2:mem=4096mb
Resource_List.walltime = 04:00:00
substate = 92
Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,PBS_O_HOME=/home/ibar,PBS_O_LOGNAME=ibar,PBS_O_WORKDIR=/gpfs1/scratch/30days/ibar/data/Dingo/Dingo_aDNA_NF13_process_24_12_2020/work/83/4cb7e477c5ad7a5236ba0d6d314708,PBS_O_LANG=en_AU.UTF-8,PBS_O_PATH=/home/ibar/.pyenv/bin:/sw/Containers/singularity/3.5.0/bin:/sw/Containers/singularity/images:/home/ibar/.pyenv/bin:/home/ibar/bin:/home/ibar/.pyenv/versions/miniconda-latest/bin:/home/ibar/.pyenv/versions/miniconda-latest/condabin:/home/ibar/.pyenv/bin:/home/ibar/bin:/opt/gnu/gcc/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/pbs/bin:/opt/rocks/bin:/opt/rocks/sbin:/home/ibar/etc/tools/Annotation/BLAT:/opt/pbs/bin:/home/ibar/etc/tools/Annotation/BLAT,PBS_O_MAIL=/var/spool/mail/ibar,PBS_O_QUEUE=workq,PBS_O_HOST=awoonga1.local
comment = Job finished at "flashmgr2"
etime = Fri Dec 25 09:28:22 2020
eligible_time = 35:46:34
Submit_arguments = -N nf-genotyping_f .command.run
history_timestamp = 1608852599
project = _pbs_project_default
And the content of .command.log
:
########################### Execution Started #############################
JobId:510838.awonmgr2
UserName:ibar
GroupName:qris-gu
ExecutionHost:fl118
###############################################################################
nxf-scratch-dir fl118.local:/nvme/pbs/tmpdir/pbs.510838.awonmgr2/nxf.urortU9s5E
########################### Job Execution History #############################
JobId:510838.awonmgr2
UserName:ibar
GroupName:qris-gu
JobName:nf-genotyping_f
SessionId:11147
ResourcesRequested:mem=4096mb,ncpus=2,place=free,walltime=04:00:00
ResourcesUsed:cpupercent=89,cput=00:12:28,mem=158040kb,ncpus=2,vmem=1129996kb,walltime=00:12:25
QueueUsed:Short
AccountString:qris-gu
ExitStatus:0
###############################################################################
This happens when jobs are being redirected from the main server (awonmgr2
) to the flashmgr2
server.
Happy to provide additional information if needed.
Thanks, Ido
Please open a new issue for that.
Consider simplistic pipeline:
proc1
would fail after 20 minutes due toTIMELIMIT
and will be re-submitted bynextflow
to thelong
queue. At the same timeproc2
would keep running in ashort
queue for 30 more minutes. For some time jobs corresponding to the re-submittedproc1
are not visible tonextflow
becausenextflow
inquires about the status of the submitted jobs using command:bjobs -o JOBID STAT SUBMIT_TIME delimiter=',' -noheader -q short
But jobs for re-submittedproc1
are on a different queue now:-q long
.Such a mixed-queue situation may lead to
terminated for an unknown reason
error. Especially ifnextflow
decides to inquire the.exitcode
of the jobs associated with re-submittedproc1
.