vib-singlecell-nf / vsn-pipelines

A repository of pipelines for single-cell data in Nextflow DSL2
GNU General Public License v3.0
75 stars 31 forks source link

[BUG] Publish processes are not cached when resuming #258

Open cflerin opened 3 years ago

cflerin commented 3 years ago

Describe the bug When running the pipeline for a second time with the -resume option, the publish processes always run and are not cached.

To Reproduce Steps to reproduce the behavior:

  1. Run one of the test workflows:
    
    nextflow pull vib-singlecell-nf/vsn-pipelines -r v0.21.0

nextflow run vib-singlecell-nf/vsn-pipelines -profile scenic,test__scenic,singularity -entry scenic -r v0.21.0


2. Re-run using resume:

nextflow run vib-singlecell-nf/vsn-pipelines -profile scenic,test__scenic,singularity -entry scenic -r v0.21.0 -resume


3. Publish steps are not cached:

$ nextflow run vib-singlecell-nf/vsn-pipelines -profile scenic,testscenic,singularity -entry scenic -r v0.21.0 -resume N E X T F L O W ~ version 20.04.1 Launching vib-singlecell-nf/vsn-pipelines [reverent_faggin] - revision: 3cc43ce065 [v0.21.0] NOTE: Your local project version looks outdated - a different revision is available in the remote repository [b6577d79a5] WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE executor > local (2) [ff/3b0342] process > scenic:SCENIC:ARBORETO_WITH_MULTIPROCESSING (1) [100%] 1 of 1, cached: 1 ✔ [3a/5b0e9f] process > scenic:SCENIC:CISTARGETMOTIF (1) [100%] 1 of 1, cached: 1 ✔ [58/e8f22b] process > scenic:SCENIC:AUCELLMOTIF (1) [100%] 1 of 1, cached: 1 ✔ [74/45ef7f] process > scenic:SCENIC:VISUALIZE (1) [100%] 1 of 1, cached: 1 ✔ [64/f374d2] process > scenic:SCENIC:PUBLISH_LOOM (1) [100%] 1 of 1, cached: 1 ✔ [81/6c7b8d] process > scenic:PUBLISH_SCENIC:COMPRESS_HDF5 (1) [100%] 1 of 1, cached: 1 ✔ [8c/19d3a4] process > scenic:PUBLISH_SCENIC:SC__PUBLISH (1) [100%] 1 of 1 ✔ [26/c75c86] process > scenic:PUBLISH_SCENIC:SCPUBLISH_PROXY (1) [100%] 1 of 1 ✔



**Expected behavior**
All steps should be cached on resume

**Please complete the following information:**
 - OS: CentOS Linux release 7.8.2003 (Core)
 - Nextflow Version: 20.04.1
 - vsn-pipelines Version: v0.21.0

**Additional context**
N/A
cflerin commented 3 years ago

After some digging, this is caused by the getOutputFileName function: https://github.com/vib-singlecell-nf/vsn-pipelines/blob/b5167f5b31129f51dece9a309dc20dc686612d10/src/utils/processes/utils.nf#L374-L406 since replacing the output of this function with a fixed string in the publish functions results in proper and consistent caching of these processes on resume. I think the problem could be due to the way getOutputFileName is run inside of the publish process, creating inputs that are dynamic, and forcing the process to be re-executed every time.

As a side note, this is a major issue for disk space in larger (e.g. mapping) projects. I ran into an issue where pbs jobs were failing to be submitted late in the atac_preprocess workflow. Each time I re-ran with -resume all of the upstream published files (fastq, bam files, etc.) were copied again within work/, leaving 100s of GBs of extra data.

dweemx commented 3 years ago

I understand, this is quite annoying! Calling getOutputFileName seems to me deterministic. Also if getOutputFileName is the root of the issue, I would also expect scenic:PUBLISH_SCENIC:COMPRESS_HDF5 not to resume (since it is also calling this function) but it does. Very intriguing this bug

dweemx commented 3 years ago

I just noticed that the NXF processes that do not resume are the ones using the outputFileName variable in their publishDir directive.

cflerin commented 3 years ago

I just noticed that the NXF processes that do not resume are the ones using the outputFileName variable in their publishDir directive.

Nice! That's an interesting point there. I hadn't tested COMPRESS_HDF5.

cflerin commented 3 years ago

COMPRESS_HDF5 still appears to cache properly when changing the publishDir output to use outputFileName.