nextflow-io / nf-prov

Apache License 2.0
23 stars 11 forks source link

Manifest does not include published outputs from cached tasks #22

Closed robsyme closed 5 months ago

robsyme commented 5 months ago

If a run has to be resumed, the nf-prov plugin omits published outputs from cached tasks.

Users will likely expect that a record of data provenance would not be sensitive to run-specific details such as whether a task was executed on a previous run or not.

This can be confirmed by running a dummy pipeline:

process MakeFile {
    publishDir "results"
    output: path("out.txt")
    script: "echo done > out.txt"
}

workflow {
    MakeFile()
}

with configuration:

plugins {
  id 'nf-prov'
}

prov {
  enabled = true
  formats {
    legacy {
      file = 'manifest.json'
      overwrite = true
    }
  }
}

On the first run, the manifest.json contains (with some paths truncated):

{
    "pipeline": null,
    "published": [
        {
            "source": "work/eb/25ff99ca14310428cec6a13c17435c/out.txt",
            "target": "results/out.txt",
            "publishingTaskId": "1"
        }
    ],
    "tasks": {
        "1": {
            "id": "1",
            "name": "MakeFile",
            "cached": false,
            "process": "MakeFile",
            "inputs": [

            ],
            "outputs": [
                {
                    "name": null,
                    "emit": null,
                    "value": "work/eb/25ff99ca14310428cec6a13c17435c/out.txt"
                }
            ]
        }
    }
}

but if we run again with -resume, the published outputs are removed:

{
    "pipeline": null,
    "published": [

    ],
    "tasks": {
        "1": {
            "id": "1",
            "name": "MakeFile",
            "cached": true,
            "process": "MakeFile",
            "inputs": [

            ],
            "outputs": [
                {
                    "name": null,
                    "emit": null,
                    "value": "work/eb/25ff99ca14310428cec6a13c17435c/out.txt"
                }
            ]
        }
    }
}
bentsherman commented 5 months ago

Which version of Nextflow are you using? There was a bug in Nextflow about this which was fixed in 23.04.4

robsyme commented 5 months ago

23.10.1

I see the same empty "published" list behaviour on 23.04.4 as well.

bentsherman commented 5 months ago

This is happening because the published files are symlinked by default. In this case, on a resume the source and target are considered the "same real path" so the publish event is not sent:

https://github.com/nextflow-io/nextflow/blob/4debd56e4ee425d7d1766b9015ad14b2ed5b0a00/modules/nextflow/src/main/groovy/nextflow/processor/PublishDir.groovy#L384-L401

For now it can be fixed by using the copy publish mode. In the long term, I guess the publish event should still be emitted in this case, but the issue is with nextflow rather than nf-prov

robsyme commented 5 months ago

Ah, gotcha. Thanks for the clarification! Will close this out for now.