nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.79k stars 633 forks source link

nextflow hanging with slurm executor #361

Closed msmootsgi closed 7 years ago

msmootsgi commented 7 years ago

Nextflow seems to be hanging after the pipeline fails when running on a slurm cluster. Here is the full .nextflow.log. This particular pipeline has a workflow.onComplete block that also isn't being called.

May-31 19:17:58.816 [main] DEBUG nextflow.cli.Launcher - $> /usr/local/bin/nextflow run http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git -resume -hub gitlab -r aws_changes -latest -params-file /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/params.yaml -process.executor slurm
May-31 19:17:58.964 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 0.24.4
May-31 19:17:59.521 [main] DEBUG nextflow.scm.AssetManager - Repository URL: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git; Project: SGI-Pipelines/GeneFlagging; Hub provider: gitlab
May-31 19:17:59.529 [main] INFO  nextflow.cli.CmdRun - Pulling SGI-Pipelines/GeneFlagging ...
May-31 19:17:59.533 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging
May-31 19:17:59.884 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging/repository/files?file_path=nextflow.config&ref=master
May-31 19:18:00.018 [main] DEBUG nextflow.scm.RepositoryProvider - Request [credentials -:-] -> http://git.l.synthgeno.global/api/v3/projects/SGI-Pipelines%2FGeneFlagging/repository/files?file_path=main.nf&ref=master
May-31 19:18:00.139 [main] DEBUG nextflow.scm.AssetManager - Pulling SGI-Pipelines/GeneFlagging -- Using remote clone url: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.244 [main] INFO  nextflow.cli.CmdRun -  downloaded from http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.675 [main] DEBUG nextflow.scm.AssetManager - Git config: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/.git/config; branch: master; remote: origin; url: http://git.l.synthgeno.global/SGI-Pipelines/GeneFlagging.git
May-31 19:18:01.675 [main] INFO  nextflow.cli.CmdRun - Launching `SGI-Pipelines/GeneFlagging` [big_jennings] - revision: e8dc1db4ec [aws_changes]
May-31 19:18:01.686 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/nextflow.config
May-31 19:18:01.690 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/nextflow.config
May-31 19:18:01.825 [main] DEBUG nextflow.config.ConfigBuilder - Setting config profile: 'standard'
May-31 19:18:01.874 [main] WARN  nextflow.config.ConfigBuilder - It seems you never run this project before -- Option `-resume` is ignored
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Session uuid: 57c9f9a1-cc45-41cd-be69-4bc9818b2b3e
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Run name: big_jennings
May-31 19:18:01.936 [main] DEBUG nextflow.Session - Executor pool size: 4
May-31 19:18:01.951 [main] DEBUG nextflow.cli.CmdRun -
  Version: 0.24.4 build 4341
  Modified: 22-05-2017 11:18 UTC
  System: Linux 4.4.0-64-generic
  Runtime: Groovy 2.4.10 on OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11
  Encoding: UTF-8 (UTF-8)
  Process: 4001@ip-172-20-22-81 [172.20.22.81]
  CPUs: 4 - Mem: 15.7 GB (14.3 GB) - Swap: 0 (0)
May-31 19:18:02.043 [main] DEBUG nextflow.Session - Work-dir: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work [nfs]
May-31 19:18:02.044 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /tools/nextflow/assets/SGI-Pipelines/GeneFlagging/bin
May-31 19:18:02.428 [main] DEBUG nextflow.Session - Session start invoked
May-31 19:18:02.432 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start
May-31 19:18:02.433 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/trace.tsv
May-31 19:18:02.446 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing
May-31 19:18:10.194 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-31 19:18:10.197 [main] WARN  nextflow.script.ScriptBinding - Access to undefined parameter `clean_up` -- Initialise it to a default value eg. `params.clean_up = some_value`
May-31 19:18:10.234 [main] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider
May-31 19:18:10.245 [main] DEBUG nextflow.file.FileHelper - AWS S3 config details: {region=us-west-2}
May-31 19:18:10.989 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:10.989 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.000 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.002 [main] INFO  nextflow.executor.Executor - [warm up] executor > slurm
May-31 19:18:11.009 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'slurm' > capacity: 100; pollInterval: 5s; dumpInterval: 5m
May-31 19:18:11.012 [main] DEBUG nextflow.processor.TaskDispatcher - Starting monitor: TaskPollingMonitor
May-31 19:18:11.013 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: slurm)
May-31 19:18:11.016 [main] DEBUG nextflow.executor.Executor - Invoke register for executor: slurm
May-31 19:18:11.017 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.056 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_blast_nr_db)
May-31 19:18:11.069 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_blast_nr_db -- maxForks: 4
May-31 19:18:11.091 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.091 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.091 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.092 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.092 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_pfam_db)
May-31 19:18:11.095 [Actor Thread 2] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> Poison pill arrived
May-31 19:18:11.102 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_blast_nr_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.110 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_pfam_db -- maxForks: 4
May-31 19:18:11.128 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.128 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.128 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.129 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.129 [main] DEBUG nextflow.Session - >>> barrier register (process: fetch_superfam_db)
May-31 19:18:11.136 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > fetch_superfam_db -- maxForks: 4
May-31 19:18:11.138 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> Poison pill arrived
May-31 19:18:11.138 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_pfam_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.157 [Actor Thread 6] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> Poison pill arrived
May-31 19:18:11.162 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.162 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.162 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <fetch_superfam_db> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.163 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.163 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.164 [main] DEBUG nextflow.Session - >>> barrier register (process: order_by_gc)
May-31 19:18:11.172 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > order_by_gc -- maxForks: 4
May-31 19:18:11.197 [main] DEBUG nextflow.processor.ProcessFactory - << taskConfig executor: slurm
May-31 19:18:11.198 [Actor Thread 8] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> Poison pill arrived
May-31 19:18:11.203 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <order_by_gc> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.198 [main] DEBUG nextflow.processor.ProcessFactory - >> processorType: 'slurm'
May-31 19:18:11.205 [main] DEBUG nextflow.executor.Executor - Initializing executor: slurm
May-31 19:18:11.205 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
May-31 19:18:11.206 [main] DEBUG nextflow.Session - >>> barrier register (process: contig_stats)
May-31 19:18:11.224 [Actor Thread 9] DEBUG nextflow.processor.TaskProcessor - Copying to process workdir foreign file: s3:///sgi-pipeline-dev/geneflagging_data/contigs.fa
May-31 19:18:11.240 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > contig_stats -- maxForks: 4
May-31 19:18:11.269 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - <contig_stats> Poison pill arrived
May-31 19:18:11.271 [Actor Thread 12] DEBUG nextflow.processor.TaskProcessor - Copying to process workdir foreign file: s3:///sgi-pipeline-dev/geneflagging_data/contigs.fa
May-31 19:18:11.278 [Actor Thread 1] DEBUG nextflow.processor.StateObj - <contig_stats> State before poison: StateObj[submitted: 1; completed: 0; poisoned: false ]
May-31 19:18:11.288 [Actor Thread 3] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_blast_nr_db
May-31 19:18:11.293 [Actor Thread 7] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_superfam_db
May-31 19:18:11.293 [Actor Thread 5] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > fetch_pfam_db
May-31 19:18:11.295 [Actor Thread 1] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> Sending poison pills and terminating process
May-31 19:18:11.298 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> Sending poison pills and terminating process
May-31 19:18:11.299 [Actor Thread 6] DEBUG nextflow.processor.TaskProcessor - <fetch_superfam_db> After stop
May-31 19:18:11.299 [Actor Thread 2] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> After stop
May-31 19:18:11.299 [Actor Thread 14] DEBUG nextflow.processor.TaskProcessor - <fetch_blast_nr_db> Sending poison pills and terminating process
May-31 19:18:11.301 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - <fetch_pfam_db> After stop
May-31 19:18:11.303 [Actor Thread 1] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_pfam_db)
May-31 19:18:11.304 [Actor Thread 14] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_blast_nr_db)
May-31 19:18:11.308 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: fetch_superfam_db)
May-31 19:18:11.892 [Actor Thread 9] DEBUG nextflow.executor.GridTaskHandler - Launching process > order_by_gc (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c2/2abdaf28c7057602e6edbf6dd85444
May-31 19:18:12.072 [Actor Thread 12] DEBUG nextflow.executor.GridTaskHandler - Launching process > contig_stats (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c3/9e29e54e126749a4cda9a366422f24
May-31 19:18:12.073 [Actor Thread 9] INFO  nextflow.Session - [c2/2abdaf] Submitted process > order_by_gc (1)
May-31 19:18:12.090 [Actor Thread 8] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> After stop
May-31 19:18:12.181 [Actor Thread 12] INFO  nextflow.Session - [c3/9e29e5] Submitted process > contig_stats (1)
May-31 19:18:12.185 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - <contig_stats> After stop
May-31 19:23:16.031 [Thread-3] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 2 -- first: TaskHandler[jobId: 2; id: 4; name: order_by_gc (1); status: RUNNING; exit: -; workDir: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/c2/2abdaf28c7057602e6edbf6dd85444 started: 1496258571038; exited: -; ]
May-31 19:23:26.058 [Thread-3] WARN  nextflow.processor.TaskProcessor - Process `order_by_gc (1)` terminated with an error exit status (125) -- Execution is retried (1)
May-31 19:23:26.077 [pool-2-thread-1] DEBUG nextflow.executor.GridTaskHandler - Launching process > order_by_gc (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/88/a767bf3eb8c490cc267e6e25ec21b2
May-31 19:23:26.201 [pool-2-thread-1] INFO  nextflow.Session - [88/a767bf] Re-submitted process > order_by_gc (1)
May-31 19:23:26.203 [Thread-3] WARN  nextflow.processor.TaskProcessor - Process `contig_stats (1)` terminated with an error exit status (125) -- Execution is retried (1)
May-31 19:23:26.220 [pool-2-thread-2] DEBUG nextflow.executor.GridTaskHandler - Launching process > contig_stats (1) -- work folder: /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/23/d2a1c73986ef90e0e685cf32def4df
May-31 19:23:26.329 [pool-2-thread-2] INFO  nextflow.Session - [23/d2a1c7] Re-submitted process > contig_stats (1)
May-31 19:24:01.085 [Thread-3] ERROR nextflow.processor.TaskProcessor - Error executing process > 'order_by_gc (1)'

Caused by:
  Process `order_by_gc (1)` terminated with an error exit status (125)

Command executed [/tools/nextflow/assets/SGI-Pipelines/GeneFlagging/templates/orderFastaByGc.py]:

  #!/usr/bin/env python

  from gene_annotation.fasta.sortFastaByGc import sortFastaByGc

  def main():

      inputFile = 'contigs.fa'
      outputFile = 'contigsByGc.fa'

      sortFastaByGc(inputFile, outputFile)

  if __name__ == '__main__':
      main()

Command exit status:
  125

Command output:
  (empty)

Command error:
  Unable to find image 'dockreg-dev01.awsv.l.synthgeno.global/compbio/gene_annotation:0.1.0' locally
  docker: Error response from daemon: Get https://dockreg-dev01.awsv.l.synthgeno.global/v1/_ping: dial tcp 172.17.0.30:443: getsockopt: no route to host.
  See 'docker run --help'.

Work dir:
  /mnt/efs/ubuntu/nextflow.run.47342766-b597-4f67-8726-82c4e9a0d267/work/88/a767bf3eb8c490cc267e6e25ec21b2

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
May-31 19:24:01.088 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <order_by_gc> Sending poison pills and terminating process
May-31 19:24:01.088 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: order_by_gc)
May-31 19:24:01.097 [Thread-3] DEBUG nextflow.Session - Session aborted -- Cause: Process `order_by_gc (1)` terminated with an error exit status (125)
May-31 19:24:01.119 [Actor Thread 13] DEBUG nextflow.processor.TaskProcessor - <contig_stats> Sending poison pills and terminating process
May-31 19:24:01.120 [Actor Thread 13] DEBUG nextflow.Session - <<< barrier arrive (process: contig_stats)
May-31 19:24:01.122 [Thread-3] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm)
~
pditommaso commented 7 years ago

In the log file I can't see the Await termination entry written by this method. This means there's something that hangs the execution of your script which never reach the termination.

Possible candidates manual managed channel not closed correctly, or a .val / .getVal() applied to an empty channel.

msmootsgi commented 7 years ago

I'm going to piggy back on this ticket, because now I'm not sure the hanging is related to the error above. I've now seen the hanging behavior multiple times. I've attached an example nextflow.log and jstack output.

The behavior I see is that the task to be completed listed in the logs blast_clusters_parse (173) has been submitted, the directory work/bc/0709... has been created, the .command.run* files have been created, but none of the file inputs from the channel have been symlinked into the directory. There are no tasks in the slurm queue and the nextflow process is still running.

frozen.jstack.out.txt frozen.nextflow.log.txt

In a separate case, I saw the identical behavior, but the only difference was that the blast_clusters_parse task had been submitted and failed because of a docker error and had been resubmitted. The resubmitted task was the one hanging.

I see the same behavior with 0.24.4 and 0.25.0-RC4.

pditommaso commented 7 years ago

There's at least a task which does not start as expected. These are the entry in the log:

Jun-21 22:43:07.301 [Pending tasks thread] DEBUG nextflow.executor.GridTaskHandler - Launching process > blast_clusters_parse (173) -- work folder: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605
Jun-21 22:43:07.419 [Pending tasks thread] INFO  nextflow.Session - [bc/070942] Submitted process > blast_clusters_parse (173)
: 
Jun-21 22:56:02.200 [Running tasks thread] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 12 -- first: TaskHandler[jobId: 14953; id: 4834; name: blast_clusters_parse (173); status: SUBMITTED; exit: -; workDir: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605 started: -; exited: -; ]
:
Jun-21 23:41:02.737 [Running tasks thread] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 2 -- first: TaskHandler[jobId: 14953; id: 4834; name: blast_clusters_parse (173); status: SUBMITTED; exit: -; workDir: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605 started: -; exited: -; ]

The task remains in SUBMITTED status, this means the it has been submitted for execution to SLURM which assigned the job-id 14953. When a task is executed the first thing the .command.run wrapper does is to create the .command.begin marker. NF uses this file or the .exitcode file to detect that the job has started. Hence, there could be three possibilities:

1) SLURM for some reason didn't execute the task 2) The task was executed but it failed immediately (but at least the .exitcode should exist) 3) For some reason the files on the file system where not written / got lost.

Have you any way to troubleshot this conditions? Does SLURM an history/accounting of the jobs executed?

msmootsgi commented 7 years ago

I have full access to the cluster, so I should be able to sort this out. While I've got logging enabled, I didn't have accounting or job completion logging enabled. Gonna enable those now and try to reproduce!

msmootsgi commented 7 years ago

I believe I've found one problem. When a cluster node is idle it has logic to shut itself down. In one case I saw that a node had decided to shut itself down, but in the time between when it decided to shut down and when the node was removed from the slurm configuration a job got submitted and somehow got lost in the shuffle. I believe I've fixed this by having the node set itself to the DOWN state immediately before shutting down.

However, despite this fix, I'm still seeing a process get lost. Just like above, nextflow submits the job, the work dir and .command scripts get created, but there are no symlinked files or .exitcode. In this case the slurm job_comp.log lists the job and says that it completed. The node in question did not go down when the job was supposedly running.

I wonder if I can write a slurm epilog script that double checks whether the .exitcode actually exists? Not sure what else I can do to debug this.

pditommaso commented 7 years ago

In this case the slurm job_comp.log lists the job and says that it completed.

The first thing the job wrapper does is to create a file named .command.begin to mark that job as started. So I don't see how it can complete w/o creating that file. Could it a NFS problem?

Also I would try to run this with process.scratch=true so that task will run on the node local storage and the result will be copied on the shared folder on job completions.

pditommaso commented 7 years ago

I'm going to close because there's no feedback any more. Feel free to comment / reopen if needed.

msmootsgi commented 7 years ago

Sorry for the lack of feedback - it just took a while to reconfigure things so that I could actually try process.scratch=true. That DID seem to help, so perhaps what I was seeing were weird NFS/EFS problems. If I run into anything reproducible I'll reopen.

stevekm commented 5 years ago

Dont have much to add to the discussion but our SLURM system is having a lot of issues with the new GPFS storage system and I am getting similar effects on my pipeline. Nextflow hangs, seemingly for days, after some tasks complete successfully. Have not tried enable scratch yet because we also have issues keeping node tmp from filling up, and I am not clear if the issues with the GPFS would still come into play (the need to copy from the scratch dir on /tmp back to the workd dir on /gpfs).

sanderthierens commented 1 year ago

We got the same issue with a setup using SLURM as an executor. After a while it seems like nextflow stops sending the next task to run. However if I create a new ssh connection to the managing node (where the nextflow script is running) it resumes. No error messages can be found in either the slurmctld.log file or the .nextflow.log. While it stalls I already tried to execute a sbatch command to confirm it is not SLURM is not hanging and this test job can be executed without any problems. I also noticed there is a big time difference between the timestamp of the slurmctld.log indicating the completion of a job and the timestamp of the completion of this job in the .nextflow.log (can differ more than 1h)