Open nickhsmith opened 2 years ago
I am still dealing with this problem with the nf-core/rnaseq
pipeline this time, and the lack of throughput is quite frustrating. @csawye01 and I are both having the same issue. Is there anywhere I can try to look or dive into why this isn't working? Attached is my .nextflow.log file
I tried running this pipeline with the same configuration without SLURM and it worked without a problem. But I really need it to work with SLURM in my shared compute environment.
@pditommaso I was wondering if there has been an exploration of this problem? Based on other issues it seems like there may be some problems with the platform/slurm tag, but I haven't seen a lot of discussion of these issues.
Is there anything I can do to help push for this fix? Otherwise, the lack of throughput is severely limiting.
Please let me know, I know it would also help other users (such as @csawye01) as well.
Hard to say anything w/o more detailed debug info. Please executing it again adding the -trace nextflow.executor
option as shown below, and upload the .nextflow.log file when done:
nextflow -trace nextflow.executor run [your pipeline params]
Thank you for the comment! I will upload the log file when I can. Thanks again!
Attached is the zipped log file. I have edited the PATH names, but otherwise, this is the result after 2 days (~1 day of hanging behavior)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi all, I also did run into the same problem. In case you would consider it helpful, I would be happy to provide further information to help the debugging process.
@ligrob I would love to know what you went through, and how you got around it! Thanks for responding!
@ligrob I got the exact same problem as @nickhsmith, I did notice that I could force nextflow to resume by making a new ssh connection to the machine. If you could tell me the work around this would help me a lot. Thanks!
I have encountered the same problem with a nf-core/rnaseq pipeline and slurm. A thread dump shows where Nextflow hangs:
"Task monitor" 38 prio=5 os_prio=0 cpu=25638.16ms elapsed=76825.33s allocated=1170M defined_classes=248 tid=0x00007f05fd2e7740 nid=0x2af5ff runnable [0x00007f05d20c9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.read0(java.base@17.0.6/Native Method) at sun.nio.ch.FileDispatcherImpl.read(java.base@17.0.6/FileDispatcherImpl.java:48) ... at nextflow.executor.GridTaskHandler.readExitStatus(GridTaskHandler.groovy:371) at nextflow.executor.GridTaskHandler.checkIfCompleted(GridTaskHandler.groovy:452) at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:611) at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:534)
It seems that nextflow no longer recognizes completed Slurm jobs when it tries to read the file .exitcode
.
The Java call to read this file gets stuck and does not return.
Nextflow resumes in my case when I call ls
, lsof
or stat
for this file.
As a workaround, I have placed this snippet in my shell script before calling nextflow run
:
WORK_DIR="./work"
nohup lsof +D ${WORK_DIR} -r 600 &> /dev/null &
LSOF_PID=$!
trap "kill $LSOF_PID" EXIT
nextflow run ...
-work-dir ${WORK_DIR}
...
Environment
nextflow version: 23.04.1 build 5866
Ubuntu 22.04
Ceph storage platform
openjdk version "17.0.7" 2023-04-18
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I seem to have the same problem. And very difficult to debug because it only occurs when submitting a lot of processes and after successfully running for 4+ hours. I've tried with nextflow version 23.10.1.5891
and nextflow version 24.02.0-edge.5907
so the latest updates haven't fixed this yet. Nextflow successfully submits and monitors hundreds of jobs to SLURM, and builds up it's own queue because I set queueSize = 50
to prevent overloading SLURM. It looks like after a long time, when Nextflow misses enough jobs that have actually completed, it won't submit new jobs because it thinks the SLURM queue is full. So the pipeline works well for 4+ hours and continuously submits new processes so the queue always has 50 processes and then at some point it finishes processes without starting new ones and then only the master process is left in my SLURM queue.
Apr-10 23:19:53.341 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor slurm > tasks in the submission queue: 368 -- tasks to be submitted are shown below
~> TaskHandler[jobId: null; id: 418; name: fastqc_trimmed_reads (QKS3-nextseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/69/54dad65c13535f215df445e6b10c94 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 430; name: fastqc_trimmed_reads (2022187367-miniseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/c8/1669049c5a1e2070b618d0c115cf06 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 448; name: unicycler (2022187367-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/5a/c02d66e8ea2293624de2c324a52741 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 681; name: prokka (2022187361-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/2a/c7c8be53daf59e3b1936c88bb38c3e started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 683; name: fastqc_trimmed_reads (2022187369-miniseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/de/c8f00c57c95834c374b11c2796b663 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 900; name: fastqc_trimmed_reads (2022187352-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/f8/75da6626806692461015a62da7539a started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 901; name: unicycler (2022187352-miseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/a5/f0fe0e1734fb36d1081d2eeca19a76 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 908; name: abricate (2022187356-nextseq.fna); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/4c/3a316a1e3d5ad595312096fe3dc005 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 909; name: quast (2022187356-nextseq.fna); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/78/7d9eb68755327dbfdcc08db232f963 started: -; exited: -; ]
~> TaskHandler[jobId: null; id: 910; name: gtdbtk_classify_wf (2022187356-nextseq); status: NEW; exit: -; error: -; workDir: /REDACTED/PATH/20240409_Validation/retry_nextflow_24.02.0/work/7d/bdf54b108e7f3af7510e61dce2a1bb started: -; exited: -; ]
.. remaining tasks omitted.
Apr-10 23:22:27.459 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] multiqc_raw_fastqc
status=ACTIVE
port 0: (value) OPEN ; channel: fastqcs
port 1: (cntrl) - ; channel: $
[process] multiqc_trimmed_fastqc
status=ACTIVE
port 0: (value) OPEN ; channel: fastqcs
port 1: (value) bound ; channel: trim_logs
port 2: (cntrl) - ; channel: $
[process] prokka
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
...
But when I go to the working directory of the jobs that it hasn't recognized as COMPLETED, the .exitcode
file exists and contains a 0
. I tried @hmehlan approach of calling ls
, lsof
or stat
on this directory/file but it doesn't do anything for me. The only thing that works is killing the master process and restarting it with -resume
. Then it recognizes the progress and finishes the pipeline smoothly but it's not ideal to be restarting every pipeline at different times.
Here an example of a process that was submitted by Nextflow but never recognized as completed by Nextflow:
Relevant snippet from .nextflow.log
$ cat .nextflow.log | grep "13900461"
Apr-10 21:07:46.869 [Task submitter] DEBUG nextflow.executor.GridTaskHandler - [SLURM] submitted process indexRemapping (2022187387-miniseq) > jobId: 13900461; workDir: /PATH/REDACTED/20240409_Validation/retry_nextflow_24.02.0/work/92/7fa32e385dcc35628c53d2aa515e00
and the content of the working directiory
$ cd ./work/92/7fa32e385dcc35628c53d2aa515e00
$ ls -lha
total 10M
drwxrwsr-x 2 progal REDACTED 10M Apr 10 21:08 .
drwxrwsr-x 9 progal REDACTED 358M Apr 10 21:06 ..
-rw-rw-r-- 1 progal REDACTED 0 Apr 10 21:08 .command.begin
-rw-rw-r-- 1 progal REDACTED 34K Apr 10 21:08 .command.err
-rw-rw-r-- 1 progal REDACTED 132K Apr 10 21:08 .command.log
-rw-rw-r-- 1 progal REDACTED 0 Apr 10 21:08 .command.out
-rw-rw-r-- 1 progal REDACTED 12K Apr 10 21:07 .command.run
-rw-rw-r-- 1 progal REDACTED 345 Apr 10 21:07 .command.sh
-rw-rw-r-- 1 progal REDACTED 251 Apr 10 21:08 .command.trace
-rw-rw-r-- 1 progal REDACTED 1 Apr 10 21:08 .exitcode
-rw-rw-r-- 1 progal REDACTED 1.6K Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.amb
-rw-rw-r-- 1 progal REDACTED 55 Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.ann
-rw-rw-r-- 1 progal REDACTED 5.6M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.bwt
-rw-rw-r-- 1 progal REDACTED 1.4M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.pac
-rw-rw-r-- 1 progal REDACTED 2.8M Apr 10 21:08 2022187387-miniseq_concatenated_contigs.fna.sa
-rw-rw-r-- 1 progal REDACTED 68 Apr 10 21:08 bwa_index_version.txt
And the SLURM log also shows it was completed
$ sacct --format=User,JobID%50,Jobname%50,partition,state,time,nnode,ncpus,elapsed,start,end | grep "13900461"
progal 13900461 nf-indexRemapping_(2022187387-miniseq) standard COMPLETED 00:10:00 1 4 00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05
13900461.batch batch COMPLETED 1 4 00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05
13900461.extern extern COMPLETED 1 4 00:00:05 2024-04-10T21:08:00 2024-04-10T21:08:05
Did anyone find any other work around? Perhaps some SLURM specific argument that could be set? I tried making the queue size unlimited with queueSize = 0
but then I get a lot of errors from SLURM which crashes the pipeline unfortunately :/
I looked into it further and noticed that it's after a specific timepoint at which the nextflow master node stops recognizing completed jobs (not missing random jobs throughout execution). For example, I started the pipeline at 22:34 on April 12 and then any job that was not finished by 01:32 on April 13 was not recognized as finished by the master node even though the .exitcode files eventually existed and contained 0
for already submitted tasks. And the master node continued running for another 20 hours. I've re-run with multiple SLURM queue sizes, global queue checking, etc. and the stalling happens at different timepoints, sometimes after 20 minutes (~450 completed tasks), sometimes after 3 hours (~2900 completed tasks). Does anyone have any hints how to troubleshoot this further? In the .nextflow.log
every minute the pipeline does
Apr-16 01:46:45.892 [Task monitor] DEBUG n.executor.AbstractGridExecutor - Executor 'slurm' fetching queue global status
and recognizes COMPLETED and FAILED jobs but once it stalls it no longer does that, then it just gives this every 5 minutes and repeats for hours until I kill the master
Apr-16 01:50:31.659 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] prokka
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] busco
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: busco_lineages_path
port 2: (cntrl) - ; channel: $
[process] get_busco_lineages
status=ACTIVE
port 0: (value) OPEN ; channel: busco_versions
port 1: (cntrl) - ; channel: $
[process] busco_plot
status=ACTIVE
port 0: (value) OPEN ; channel: short_summaries
port 1: (cntrl) - ; channel: $
[process] quast
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] multiqc_assembly
status=ACTIVE
port 0: (value) bound ; channel: trim_logs
port 1: (value) OPEN ; channel: quast_stats
port 2: (value) OPEN ; channel: prokka_output
port 3: (value) OPEN ; channel: busco_summaries
port 4: (cntrl) - ; channel: $
[process] gtdbtk_classify_wf
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: gtdb_database
port 2: (cntrl) - ; channel: $
[process] rMLST
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: rMLST_database
port 2: (cntrl) - ; channel: $
[process] call_rMLST
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: bigs_database_rMLST
port 2: (cntrl) - ; channel: $
[process] make_one_contig
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] indexRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] alignRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] parse_sam_for_insertsize
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] samtoolsRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] pilon_remapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] coverage_pilon_corrected
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] typing_16S
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] abricate
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] summary_sample
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] merge_summaries
status=ACTIVE
port 0: (value) OPEN ; channel: sample_quality
port 1: (cntrl) - ; channel: $
[process] links_for_transfer
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] write_software_versions
status=ACTIVE
port 0: (value) OPEN ; channel: version_files
port 1: (cntrl) - ; channel: $
Apr-16 01:55:31.676 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
[process] prokka
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] busco
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: busco_lineages_path
port 2: (cntrl) - ; channel: $
[process] get_busco_lineages
status=ACTIVE
port 0: (value) OPEN ; channel: busco_versions
port 1: (cntrl) - ; channel: $
[process] busco_plot
status=ACTIVE
port 0: (value) OPEN ; channel: short_summaries
port 1: (cntrl) - ; channel: $
[process] quast
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] multiqc_assembly
status=ACTIVE
port 0: (value) bound ; channel: trim_logs
port 1: (value) OPEN ; channel: quast_stats
port 2: (value) OPEN ; channel: prokka_output
port 3: (value) OPEN ; channel: busco_summaries
port 4: (cntrl) - ; channel: $
[process] gtdbtk_classify_wf
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: gtdb_database
port 2: (cntrl) - ; channel: $
[process] rMLST
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: rMLST_database
port 2: (cntrl) - ; channel: $
[process] call_rMLST
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (value) bound ; channel: bigs_database_rMLST
port 2: (cntrl) - ; channel: $
[process] make_one_contig
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] indexRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] alignRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] parse_sam_for_insertsize
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] samtoolsRemapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] pilon_remapping
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] coverage_pilon_corrected
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] typing_16S
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] abricate
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] summary_sample
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] merge_summaries
status=ACTIVE
port 0: (value) OPEN ; channel: sample_quality
port 1: (cntrl) - ; channel: $
[process] links_for_transfer
status=ACTIVE
port 0: (queue) OPEN ; channel: -
port 1: (cntrl) - ; channel: $
[process] write_software_versions
status=ACTIVE
port 0: (value) OPEN ; channel: version_files
port 1: (cntrl) - ; channel: $
Apr-16 02:00:31.694 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > No more task to compute -- The following nodes are still active:
Anyone have any hints what to look further into? I can't see any useful error messages anywhere in the logfile
@pbieberstein: have you tried to create a Java thread dump - on the node on which the calculation was running that wrote the .exitCode file resp. the node on which this file is tried to be read? As can be seen from my thread dump https://github.com/nextflow-io/nextflow/issues/2695#issuecomment-1635939435, the problem in my case was not in nextflow, but I suspect somewhere in the interaction of Java with the file / storage system on the various slurm nodes. Have you tried calling "ls", "lsof" or "stat" for file ".exitCode" on the slurm node where the calculation was running or on the nextflow control node?
Hi @hmehlan many thanks for your helpful response! I called "ls" and "stat" on .exitcode
and some other files the work directory of one of the jobs that was finished but not recognized as such, but it didn't do anything for me. I did it through a SLURM interactive session running on the same SLURM Node as the control job and also on the same node as the SLURM node where the calculation was run. The nextflow master process was still running while I did this but it still didn't recognize the process as completed. The Java thread dump does look interesting but I can't quite identify what's going on yet (not so familiar with java debugging yet), do you see something suspicious here?
2024-04-19 21:34:20
Full thread dump OpenJDK 64-Bit Server VM (21-internal-adhoc.conda.src mixed mode, sharing):
Threads class SMR info:
_java_thread_list=0x000015442c002c80, length=19, elements={
0x000015457401b350, 0x00001545741406c0, 0x0000154574142a20, 0x00001545741444b0,
0x0000154574145a70, 0x0000154574147020, 0x0000154574148b70, 0x000015457414a1f0,
0x000015457419f060, 0x00001545741a1630, 0x0000154574ec08a0, 0x00001544b0002a80,
0x00001544b0009d20, 0x00001544b0006a60, 0x00001544b0003fc0, 0x00001544a0127e90,
0x00001544980d6250, 0x00001544cc1dadf0, 0x00001544a0133ec0
}
"main" #1 [3720502] prio=5 os_prio=0 cpu=5116.78ms elapsed=18638.65s tid=0x000015457401b350 nid=3720502 waiting on condition [0x0000154578e09000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x00000000820c89b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21-internal/AbstractQueuedSynchronizer.java:519)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21-internal/ForkJoinPool.java:3780)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21-internal/ForkJoinPool.java:3725)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1707)
at nextflow.util.Barrier$_awaitCompletion_closure3.doCall(Barrier.groovy:92)
at nextflow.util.Barrier$_awaitCompletion_closure3.call(Barrier.groovy)
at nextflow.extension.Bolts.withLock(Bolts.groovy:337)
at nextflow.extension.Bolts.withLock(Bolts.groovy)
at nextflow.util.Barrier.awaitCompletion(Barrier.groovy:89)
at nextflow.Session.await(Session.groovy:659)
at nextflow.script.ScriptRunner.await(ScriptRunner.groovy:255)
at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:143)
at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
at nextflow.cli.Launcher.run(Launcher.groovy:500)
at nextflow.cli.Launcher.main(Launcher.groovy:672)
"Reference Handler" #9 [3720510] daemon prio=10 os_prio=0 cpu=6.12ms elapsed=18639.97s tid=0x00001545741406c0 nid=3720510 waiting on condition [0x0000154558a48000]
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList(java.base@21-internal/Native Method)
at java.lang.ref.Reference.processPendingReferences(java.base@21-internal/Reference.java:246)
at java.lang.ref.Reference$ReferenceHandler.run(java.base@21-internal/Reference.java:208)
"Finalizer" #10 [3720511] daemon prio=8 os_prio=0 cpu=0.43ms elapsed=18640.99s tid=0x0000154574142a20 nid=3720511 in Object.wait() [0x0000154558947000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait0(java.base@21-internal/Native Method)
- waiting on <0x00000000802b5590> (a java.lang.ref.NativeReferenceQueue$Lock)
at java.lang.Object.wait(java.base@21-internal/Object.java:366)
at java.lang.Object.wait(java.base@21-internal/Object.java:339)
at java.lang.ref.NativeReferenceQueue.await(java.base@21-internal/NativeReferenceQueue.java:48)
at java.lang.ref.ReferenceQueue.remove0(java.base@21-internal/ReferenceQueue.java:158)
at java.lang.ref.NativeReferenceQueue.remove(java.base@21-internal/NativeReferenceQueue.java:89)
- locked <0x00000000802b5590> (a java.lang.ref.NativeReferenceQueue$Lock)
at java.lang.ref.Finalizer$FinalizerThread.run(java.base@21-internal/Finalizer.java:173)
"Signal Dispatcher" #11 [3720512] daemon prio=9 os_prio=0 cpu=0.35ms elapsed=18641.47s tid=0x00001545741444b0 nid=3720512 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Service Thread" #12 [3720513] daemon prio=9 os_prio=0 cpu=410.47ms elapsed=18641.98s tid=0x0000154574145a70 nid=3720513 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Monitor Deflation Thread" #13 [3720514] daemon prio=9 os_prio=0 cpu=1675.44ms elapsed=18642.44s tid=0x0000154574147020 nid=3720514 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" #14 [3720515] daemon prio=9 os_prio=0 cpu=25514.02ms elapsed=18643.02s tid=0x0000154574148b70 nid=3720515 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
"C1 CompilerThread0" #16 [3720516] daemon prio=9 os_prio=0 cpu=4900.05ms elapsed=18643.20s tid=0x000015457414a1f0 nid=3720516 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
"Common-Cleaner" #17 [3720517] daemon prio=8 os_prio=0 cpu=25.19ms elapsed=18643.61s tid=0x000015457419f060 nid=3720517 runnable [0x0000154558270000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x00000000802b56c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21-internal/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1847)
at java.lang.ref.ReferenceQueue.await(java.base@21-internal/ReferenceQueue.java:71)
at java.lang.ref.ReferenceQueue.remove0(java.base@21-internal/ReferenceQueue.java:143)
at java.lang.ref.ReferenceQueue.remove(java.base@21-internal/ReferenceQueue.java:218)
at jdk.internal.ref.CleanerImpl.run(java.base@21-internal/CleanerImpl.java:140)
at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
at jdk.internal.misc.InnocuousThread.run(java.base@21-internal/InnocuousThread.java:186)
"Notification Thread" #18 [3720518] daemon prio=9 os_prio=0 cpu=0.14ms elapsed=18645.02s tid=0x00001545741a1630 nid=3720518 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"ForkJoinPool-1-worker-1" #26 [3720526] daemon prio=5 os_prio=0 cpu=17155.27ms elapsed=18642.31s tid=0x0000154574ec08a0 [0x0000154549bf9000]
Carrying virtual thread #33
at jdk.internal.vm.Continuation.run(java.base@21-internal/Continuation.java:251)
at java.lang.VirtualThread.runContinuation(java.base@21-internal/VirtualThread.java:223)
at java.lang.VirtualThread$$Lambda/0x00001544fc342348.run(java.base@21-internal/Unknown Source)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@21-internal/ForkJoinTask.java:1423)
at java.util.concurrent.ForkJoinTask.doExec(java.base@21-internal/ForkJoinTask.java:387)
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@21-internal/ForkJoinPool.java:1312)
at java.util.concurrent.ForkJoinPool.scan(java.base@21-internal/ForkJoinPool.java:1843)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1808)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"VirtualThread-unparker" #27 [3720527] daemon prio=5 os_prio=0 cpu=6497.62ms elapsed=18643.06s tid=0x00001544b0002a80 nid=3720527 runnable [0x0000154549af9000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5df0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@21-internal/LockSupport.java:269)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21-internal/AbstractQueuedSynchronizer.java:1758)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@21-internal/ScheduledThreadPoolExecutor.java:1182)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@21-internal/ScheduledThreadPoolExecutor.java:899)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@21-internal/ThreadPoolExecutor.java:1070)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21-internal/ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21-internal/ThreadPoolExecutor.java:642)
at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
at jdk.internal.misc.InnocuousThread.run(java.base@21-internal/InnocuousThread.java:186)
"ForkJoinPool-1-worker-2" #162 [3720661] daemon prio=5 os_prio=0 cpu=18833.40ms elapsed=18638.53s tid=0x00001544b0009d20 nid=3720661 waiting on condition [0x00001545493f4000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"ForkJoinPool-1-worker-3" #164 [3720662] daemon prio=5 os_prio=0 cpu=16503.86ms elapsed=18639.72s tid=0x00001544b0006a60 nid=3720662 runnable [0x0000154548eef000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
at java.util.concurrent.locks.LockSupport.parkUntil(java.base@21-internal/LockSupport.java:449)
at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1891)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"ForkJoinPool-1-worker-4" #165 [3720663] daemon prio=5 os_prio=0 cpu=18765.67ms elapsed=18640.00s tid=0x00001544b0003fc0 nid=3720663 waiting on condition [0x0000154548aeb000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"ForkJoinPool-1-worker-5" #14867 [3762619] daemon prio=5 os_prio=0 cpu=7270.12ms elapsed=9525.28s tid=0x00001544a0127e90 nid=3762619 waiting on condition [0x00001544f5fee000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1893)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"PublishDir-4135" #17107 [3766036] prio=5 os_prio=0 cpu=0.33ms elapsed=8400.33s tid=0x00001544980d6250 nid=3766036 runnable [0x00001545494f5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.fs.LinuxNativeDispatcher.directCopy0(java.base@21-internal/Native Method)
at sun.nio.fs.LinuxFileSystem.directCopy(java.base@21-internal/LinuxFileSystem.java:159)
at sun.nio.fs.UnixFileSystem.copyFile(java.base@21-internal/UnixFileSystem.java:682)
at sun.nio.fs.UnixFileSystem.copy(java.base@21-internal/UnixFileSystem.java:1060)
at sun.nio.fs.UnixFileSystemProvider.copy(java.base@21-internal/UnixFileSystemProvider.java:300)
at nextflow.file.FileHelper.copyPath(FileHelper.groovy:951)
at nextflow.processor.PublishDir.processFileImpl(PublishDir.groovy:496)
at nextflow.processor.PublishDir.processFile(PublishDir.groovy:385)
at nextflow.processor.PublishDir.safeProcessFile(PublishDir.groovy:362)
at java.lang.invoke.LambdaForm$DMH/0x00001544fc3f2800.invokeVirtual(java.base@21-internal/LambdaForm$DMH)
at java.lang.invoke.LambdaForm$MH/0x00001544fc59a000.invoke(java.base@21-internal/LambdaForm$MH)
at java.lang.invoke.LambdaForm$MH/0x00001544fc15d000.invokeExact_MT(java.base@21-internal/LambdaForm$MH)
at jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(java.base@21-internal/DirectMethodHandleAccessor.java:155)
at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(java.base@21-internal/DirectMethodHandleAccessor.java:103)
at java.lang.reflect.Method.invoke(java.base@21-internal/Method.java:580)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
at nextflow.processor.PublishDir$_apply1_closure2.doCall(PublishDir.groovy:337)
at nextflow.processor.PublishDir$_apply1_closure2.call(PublishDir.groovy)
at groovy.lang.Closure.run(Closure.java:498)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@21-internal/Executors.java:572)
at java.util.concurrent.FutureTask.run(java.base@21-internal/FutureTask.java:317)
at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
"leveldb-compaction-0" #24118 [3779302] prio=5 os_prio=0 cpu=177.93ms elapsed=6090.47s tid=0x00001544cc1dadf0 nid=3779302 waiting on condition [0x0000154548dee000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x00000000820b4348> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@21-internal/LockSupport.java:371)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(java.base@21-internal/AbstractQueuedSynchronizer.java:519)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@21-internal/ForkJoinPool.java:3780)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@21-internal/ForkJoinPool.java:3725)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@21-internal/AbstractQueuedSynchronizer.java:1707)
at java.util.concurrent.LinkedBlockingQueue.take(java.base@21-internal/LinkedBlockingQueue.java:435)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@21-internal/ThreadPoolExecutor.java:1070)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21-internal/ThreadPoolExecutor.java:1130)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21-internal/ThreadPoolExecutor.java:642)
at java.lang.Thread.runWith(java.base@21-internal/Thread.java:1596)
at java.lang.Thread.run(java.base@21-internal/Thread.java:1583)
"ForkJoinPool-1-worker-6" #24650 [3780205] daemon prio=5 os_prio=0 cpu=3025.30ms elapsed=5971.12s tid=0x00001544a0133ec0 nid=3780205 waiting on condition [0x00001544f50df000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@21-internal/Native Method)
- parking to wait for <0x0000000081db5aa8> (a java.util.concurrent.ForkJoinPool)
at java.util.concurrent.locks.LockSupport.parkUntil(java.base@21-internal/LockSupport.java:449)
at java.util.concurrent.ForkJoinPool.awaitWork(java.base@21-internal/ForkJoinPool.java:1891)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@21-internal/ForkJoinPool.java:1809)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@21-internal/ForkJoinWorkerThread.java:188)
"VM Thread" os_prio=0 cpu=660.25ms elapsed=18652.10s tid=0x00001545741305c0 nid=3720509 runnable
"GC Thread#0" os_prio=0 cpu=546.58ms elapsed=18652.42s tid=0x0000154574073da0 nid=3720503 runnable
"GC Thread#1" os_prio=0 cpu=531.21ms elapsed=18652.22s tid=0x00001544f0006870 nid=3720521 runnable
"GC Thread#2" os_prio=0 cpu=572.57ms elapsed=18652.89s tid=0x00001544f0007320 nid=3720522 runnable
"GC Thread#3" os_prio=0 cpu=559.83ms elapsed=18653.25s tid=0x00001544f000f4b0 nid=3720523 runnable
"G1 Main Marker" os_prio=0 cpu=0.70ms elapsed=18654.45s tid=0x00001545740853a0 nid=3720504 runnable
"G1 Conc#0" os_prio=0 cpu=180.53ms elapsed=18655.01s tid=0x0000154574086350 nid=3720505 runnable
"G1 Refine#0" os_prio=0 cpu=2204.85ms elapsed=18655.38s tid=0x00001545741033e0 nid=3720506 runnable
"G1 Service" os_prio=0 cpu=548.14ms elapsed=18655.79s tid=0x00001545741043a0 nid=3720507 runnable
"VM Periodic Task Thread" os_prio=0 cpu=7946.37ms elapsed=18655.87s tid=0x0000154574116c60 nid=3720508 waiting on condition
JNI global refs: 17, weak refs: 0
Heap
garbage-first heap total 324608K, used 282114K [0x0000000080000000, 0x0000000100000000)
region size 1024K, 165 young (168960K), 12 survivors (12288K)
Metaspace used 43518K, committed 45504K, reserved 1114112K
class space used 5081K, committed 6016K, reserved 1048576K
@pbieberstein: As far as I can see from the thread dump, nextflow is busy copying files:
java.lang.Thread.State: RUNNABLE
at sun.nio.fs.LinuxNativeDispatcher.directCopy0(java.base@21-internal/Native Method)
at sun.nio.fs.LinuxFileSystem.directCopy(java.base@21-internal/LinuxFileSystem.java:159)
at sun.nio.fs.UnixFileSystem.copyFile(java.base@21-internal/UnixFileSystem.java:682)
at sun.nio.fs.UnixFileSystem.copy(java.base@21-internal/UnixFileSystem.java:1060)
This should not be the problem, unless nextflow is stuck at this point. It is best to create another thread dump after some time and compare. There are free java thread dump analysers on the net - but even these had no further findings.
@hmehlan
The "lsof trick" seems to do the job for us. I saw that, like us, you are using cephfs. In our case, the problem arose just after the update from an older ceph version to Ceph Reef 18.2.2 (not sure it is related, though).
Thanks for your interesting fix.
Bug report
I am noticing a problem with the Nextflow-SLURM interface when processing large number of jobs. For example, I am running a workflow and have spawned ~10,000 tasks (50 samples * 150 intervals/sample) + downstream tasks) using the nf-core sarek workflow. However, I notice that my tasks are hanging, and as a result, nothing is being submitted to SLURM despite the jobs being completed.
Expected behavior and actual behavior
Nextflow generates a task. Nextflow submits the task to SLURM. SLURM runs the job. SLURM finishes the job (creating
.exitcode
with value 0). Nextflow recognizes this and submits the next task.This last step of recognizing a finished job seems to be failing.
Steps to reproduce the problem
More tangibly, I have used
queueSize
to limit my workflow to 100 jobs at 1 time. Some of those jobs will finish (.exitcode
exists with the value 0 in the local work dir (work/76/69b9b4...
)) BUT these jobs are still considered RUNNING when I look at the.nextflow.log
file (see last photo)Looking through the log file. I see that the SLURM job ID
636541
is submitted at 15:17 on March 2 It finishes and creates the .exitcode file (with a successful 0 exit) at 18:10. SLURM thinks the job is completed at this point.but as of yesterday March 3 there is no completed status of this job, as a result SLURM won't start any of the remaining ~7000 jobs in the task scheduler)
Any help would be appreciated, but is there a reason that the completed jobs aren't being recognized as such, or what I can do to force a re-recognition? Thanks in advance
Environment
nextflow info
Version: 21.10.6 build 5660 Created: 21-12-2021 16:55 UTC (17:55 CEST) System: Linux 4.18.0-348.12.2.el8_5.x86_64 Runtime: Groovy 3.0.9 on OpenJDK 64-Bit Server VM 10.0.2+13 Encoding: UTF-8 (UTF-8)shell: bash