Closed msmootsgi closed 7 years ago
In the log file I can't see the Await termination
entry written by this method. This means there's something that hangs the execution of your script which never reach the termination.
Possible candidates manual managed channel not closed correctly, or a .val
/ .getVal()
applied to an empty channel.
I'm going to piggy back on this ticket, because now I'm not sure the hanging is related to the error above. I've now seen the hanging behavior multiple times. I've attached an example nextflow.log and jstack output.
The behavior I see is that the task to be completed listed in the logs blast_clusters_parse (173)
has been submitted, the directory work/bc/0709...
has been created, the .command.run*
files have been created, but none of the file inputs from the channel have been symlinked into the directory. There are no tasks in the slurm queue and the nextflow process is still running.
frozen.jstack.out.txt frozen.nextflow.log.txt
In a separate case, I saw the identical behavior, but the only difference was that the blast_clusters_parse
task had been submitted and failed because of a docker error and had been resubmitted. The resubmitted task was the one hanging.
I see the same behavior with 0.24.4
and 0.25.0-RC4
.
There's at least a task which does not start as expected. These are the entry in the log:
Jun-21 22:43:07.301 [Pending tasks thread] DEBUG nextflow.executor.GridTaskHandler - Launching process > blast_clusters_parse (173) -- work folder: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605
Jun-21 22:43:07.419 [Pending tasks thread] INFO nextflow.Session - [bc/070942] Submitted process > blast_clusters_parse (173)
:
Jun-21 22:56:02.200 [Running tasks thread] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 12 -- first: TaskHandler[jobId: 14953; id: 4834; name: blast_clusters_parse (173); status: SUBMITTED; exit: -; workDir: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605 started: -; exited: -; ]
:
Jun-21 23:41:02.737 [Running tasks thread] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 2 -- first: TaskHandler[jobId: 14953; id: 4834; name: blast_clusters_parse (173); status: SUBMITTED; exit: -; workDir: /mnt/efs/nextflow/run.b0b42c39-12cf-411a-b4d5-bd8730d0fe59/work/bc/0709426de81ca3dcb79e4827d78605 started: -; exited: -; ]
The task remains in SUBMITTED
status, this means the it has been submitted for execution to SLURM which assigned the job-id 14953
. When a task is executed the first thing the .command.run
wrapper does is to create the .command.begin
marker. NF uses this file or the .exitcode
file to detect that the job has started. Hence, there could be three possibilities:
1) SLURM for some reason didn't execute the task
2) The task was executed but it failed immediately (but at least the .exitcode
should exist)
3) For some reason the files on the file system where not written / got lost.
Have you any way to troubleshot this conditions? Does SLURM an history/accounting of the jobs executed?
I have full access to the cluster, so I should be able to sort this out. While I've got logging enabled, I didn't have accounting or job completion logging enabled. Gonna enable those now and try to reproduce!
I believe I've found one problem. When a cluster node is idle it has logic to shut itself down. In one case I saw that a node had decided to shut itself down, but in the time between when it decided to shut down and when the node was removed from the slurm configuration a job got submitted and somehow got lost in the shuffle. I believe I've fixed this by having the node set itself to the DOWN state immediately before shutting down.
However, despite this fix, I'm still seeing a process get lost. Just like above, nextflow submits the job, the work dir and .command
scripts get created, but there are no symlinked files or .exitcode
. In this case the slurm job_comp.log lists the job and says that it completed. The node in question did not go down when the job was supposedly running.
I wonder if I can write a slurm epilog script that double checks whether the .exitcode
actually exists? Not sure what else I can do to debug this.
In this case the slurm job_comp.log lists the job and says that it completed.
The first thing the job wrapper does is to create a file named .command.begin
to mark that job as started. So I don't see how it can complete w/o creating that file. Could it a NFS problem?
Also I would try to run this with process.scratch=true
so that task will run on the node local storage and the result will be copied on the shared folder on job completions.
I'm going to close because there's no feedback any more. Feel free to comment / reopen if needed.
Sorry for the lack of feedback - it just took a while to reconfigure things so that I could actually try process.scratch=true
. That DID seem to help, so perhaps what I was seeing were weird NFS/EFS problems. If I run into anything reproducible I'll reopen.
Dont have much to add to the discussion but our SLURM system is having a lot of issues with the new GPFS storage system and I am getting similar effects on my pipeline. Nextflow hangs, seemingly for days, after some tasks complete successfully. Have not tried enable scratch
yet because we also have issues keeping node tmp from filling up, and I am not clear if the issues with the GPFS would still come into play (the need to copy from the scratch dir on /tmp back to the workd dir on /gpfs).
We got the same issue with a setup using SLURM as an executor. After a while it seems like nextflow stops sending the next task to run. However if I create a new ssh connection to the managing node (where the nextflow script is running) it resumes. No error messages can be found in either the slurmctld.log file or the .nextflow.log. While it stalls I already tried to execute a sbatch command to confirm it is not SLURM is not hanging and this test job can be executed without any problems. I also noticed there is a big time difference between the timestamp of the slurmctld.log indicating the completion of a job and the timestamp of the completion of this job in the .nextflow.log (can differ more than 1h)
Nextflow seems to be hanging after the pipeline fails when running on a slurm cluster. Here is the full .nextflow.log. This particular pipeline has a
workflow.onComplete
block that also isn't being called.