labsyspharm / mcmicro

Multiple-choice microscopy pipeline
https://mcmicro.org/
MIT License
104 stars 58 forks source link

SLURM and nextflow going out of sync #169

Open ArtemSokolov opened 4 years ago

ArtemSokolov commented 4 years ago

There appears to be a repeating problem with quantification jobs finishing but getting detected as failures, with the following error:

Caused by:
  Process `quantification (57)` terminated for an unknown reason -- Likely it has been terminated by the external system

The python process finishes and produced expected output in the corresponding work directory. However, the file never gets published to quantification/, because nextflow detects (or fails to detect) something and terminates the entire pipeline run.

Possible explanation: .exitcode getting written to scratch3 before the output files, causing nextflow to look for output files that are not there yet (as described in https://github.com/nextflow-io/nextflow/issues/931) Starting point for a possible minimal reproducible example: core 66.tif from TMA11

ArtemSokolov commented 4 years ago

Temporary workaround: use -resume feature.

The outputs are actually properly produced by the quantification module, but nextflow looks for them before SLURM finishes writing them to the working directory. As a result, the pipeline terminates, but the files eventually appear in the workdir. Thus, -resume will detect the presence of those output files and will treat them as the process cache.

ArtemSokolov commented 4 years ago

This was observed to happen with ASHLAR as well.

ArtemSokolov commented 4 years ago

Issue reported to nextflow devs: https://github.com/nextflow-io/nextflow/issues/1644

Unfortunately, since it's somewhat intermittent and difficult to reproduce consistently, it may be a while before this is fully resolved.

pditommaso commented 3 years ago

We have seen happen this when the shared file system has an aggressive caching strategy and therefore the remote node (running nextflow) is not able to detect the files that have been created by the compute node, in particular the .exitcode.

A possible solution consists to increase the exitReadTimeout timeout to a higher value. See here for details https://www.nextflow.io/docs/latest/config.html#scope-executor

ArtemSokolov commented 3 years ago

Thank you, Paolo. We'll play around with it.