nextflow-io / nf-prov

Apache License 2.0
27 stars 12 forks source link

make workflow named outputs show up in the manifest #23

Closed stevekm closed 8 months ago

stevekm commented 8 months ago

Right now the manifest JSON output looks something like this

{
    "pipeline": null,
    "published": [
        {
            "source": "/pipeline/work/f6/48b7d3a739069878e46051b5a7bbc4/file1.txt",
            "target": "/pipeline/output/file1.txt",
            "publishingTaskId": "16"
        },
        {
            "source": "/pipeline/work/d6/5aede3bcd70eb8ac3fff17b60c033b/file2.txt",
            "target": "/pipeline/output/file2.txt",
            "publishingTaskId": "18"
        },
 ...

However I am able to define my pipeline's main workflow section to have named outputs, like this

// main.nf
nextflow.enable.dsl=2

include { MY_SUBWORKFLOW } from './workflows/do_things.nf'

workflow {
    main:
    samples_ch = Channel.from(file(params.samplesheet))

    MY_SUBWORKFLOW(samples_ch)

    emit:
    myfiles = MY_SUBWORKFLOW.out.allmyfiles
}

It would be really helpful if we could somehow keep the label such as myfiles associated with the published files, maybe something like this

{
    "pipeline": null,
    "published": [
        {
            "source": "/pipeline/work/f6/48b7d3a739069878e46051b5a7bbc4/file1.txt",
            "target": "/pipeline/output/file1.txt",
            "publishingTaskId": "16",
            "emit": "myfiles"
        },
        {
            "source": "/pipeline/work/d6/5aede3bcd70eb8ac3fff17b60c033b/file2.txt",
            "target": "/pipeline/output/file2.txt",
            "publishingTaskId": "18",
            "emit": "myfiles"
        },
 ...

This would be really helpful for downstream processing, so that you could parse the manifest JSON and identify specific files. For example, if you had an emit channel for MultiQC files multiqc_ch, you would be able to identify all the files with the label multiqc_ch to more easily pass them in to some other process, like a chained post-processing workflow.

@pinin4fjords

I noticed that under the tasks section of the manifest JSON, there is an emit field already in the outputs list for each task, however in all my pipelines so far it seems like the value here is null, not sure what this was meant to be used for but it seems like maybe this functionality might overlap?

bentsherman commented 8 months ago

I think the original creator of nf-prov tried to associate published outputs with the process emit, but maybe they never got it to work. As long as a file is emitted by any process output channel, it can be published, but it could be emitted by multiple process outputs.

But the problem with your request is that published outputs are not related to workflow emits at all. More fundamentally, I'm not sure that the provenance manifest is the best way to facilitate the chaining of pipelines.

I think we need some kind of workflow output schema which can be easily matched to the input schema of a downstream workflow, which does not involve workflow emits at all.

Alternatively, you could write a "meta-pipeline" which imports entire pipelines as modules and chains them together with regular dataflow logic. That would use the workflow takes/emits but not the input/output schemas, which in this case would be an unnecessary extra step. I am working on a proof-of-concept for this using fetchngs+rnaseq, hope to finish it at the hackathon next week.

pinin4fjords commented 8 months ago

Alternatively, you could write a "meta-pipeline" which imports entire pipelines as modules and chains them together with regular dataflow logic.

This should definitely be a thing. The main blockers on this (in nf-core at least) have been config-based, and @drpatelh 's related plans should help.

stevekm commented 8 months ago

Honestly, I am not really a big fan of the idea of writing "meta-pipelines" because then it seems you would have to write one for every combination of pipelines you want to chain together.

I feel like this is the better approach;

I think we need some kind of workflow output schema which can be easily matched to the input schema of a downstream workflow

( which feels related to this https://github.com/nextflow-io/nextflow/issues/4670 )

an idea floated elsewhere, was some mechanism by which you could chain pipelines in a manner like this

nextflow run main1.nf -output-schema-stdout ... |  nextflow run main2.nf -input-schema-stdin

The topic of 'pipeline chaining' per se is likely out of scope for this Issue and Repo, maybe it can be moved to some other location. But if "named outputs" were available in the nf-prov (or elsewhere??) then at least we could more easily hack it together ourselves :)

feel free to close this issue if think there's a better place for the discussions, thanks

bentsherman commented 8 months ago

I see you have commented on https://github.com/nextflow-io/nextflow/issues/4670, let's move the discussion over there. Your feedback might help us finalize the design of the workflow output schema which should be the easiest way to chain pipelines