Flag files for downstream reporting via `publishDir`

ewels commented 1 year ago

New feature

An emerging standard for Nextflow pipelines is a root tower.yml file, used for providing reports to Tower.

A potential alternative is to instead define this metadata as part of publishDir, within the Nextflow config. This has a few advantages:

Removes the need for yet-another-config-file in the repository root
Keeps configuration of published files in a single location, not spread across multiple files
Less Tower-specific, more community friendly

In this location, Nextflow will know about the report status of files during the publish step and could potentially match patterns against actual files created, allowing some kind of metadata with precise file paths + report status to be generated in memory / in some kind of report.

Suggest implementation

My suggestion is to add a new directive: report (int). Non-zero values (or >0) could include that files should be shown within downstream reporting functionality. The integer value itself could then be used as a weighting factor when sorting that list.

The directive should be paired together with the ability to filter the published files for a given process based on filename / a closure.

Usage scenario

Based on the publishDir config for a process in the nf-core/rnaseq pipeline, syntax / usage could potentially look something like this:

  withName: '.*:BAM_RSEQC:RSEQC_READDISTRIBUTION' {
      publishDir = [
          path: { "${params.outdir}/${params.aligner}/rseqc/read_distribution" },
          mode: params.publish_dir_mode,
          saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+         report: { filename -> filename ==~ '.*\.pdf' ? 10 : null }
      ]
  }

Here, any PDF files published by this process would be given a report priority of 10. The integer > 0 indicates that they should be shown in a report interface, value 10 gives weighting score for sorting the list of files there.

The results of of this directive then need to be handled somehow. I expect this to be the most contentious part of this suggestion! My suggestion would be a new optional output file, similar to reports and trace files. This could potentially tie into future efforts for provenance tracking of published files.

evanfloden commented 1 year ago

The file filtering is a nice touch.

Do you envision either the need or the ability to have multiple report: lines within the directive? For example, if a user wanted to have two different reports with different weightings.
Would it be possible to add meta data such as the display name as we do in the tower.yml?

ewels commented 1 year ago

Yes, as publishDir can take a list - see here for an example.
Yes, good point - I guess we need a new directive for that. report_title?

ewels commented 1 year ago

Hmm, I keep saying directive but I guess that these are actually new options for the existing publishDir directive, not new directives. Apologies, but hopefully you understand what I'm saying anyway 🙄

evanfloden commented 1 year ago

Brilliant. So a complete example could be:

publishDir = [
    [
        path: { "${params.outdir}/${params.trimmer}/fastqc" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.pdf' ? 10 : null },
        report_title: { "FASTQC Report" }
    ],
    [
        path: { "${params.outdir}/${params.trimmer}" },
        mode: params.publish_dir_mode,
        report: { filename -> filename ==~ '.*\.tsv' ? 5 : null },
        report_title: { "Trimmer Gene Counts" }
    ],
]

jordeu commented 1 year ago

Current implementation we also support to define the mimeType at the YAML configuration. In general we don't use it because it is correctly deduced from the file extension.

And maybe in a future we'll have more things... I can imagine things like to choose the icon to show, to select a specific viewer for that file, or pass config parameters for some viewers...

We can keep adding reportTitle, reportMimeType... to the publishDir or we can make report expect a map instead of an int.

Something like:

publishDir = [
      [
          path: { "${params.outdir}/${params.trimmer}/fastqc" },
          mode: params.publish_dir_mode,
          report: { path -> path ==~ '.*\.pdf' ? [weight: 10, title: "FASTQC Report"] : null }
      ],
      [
          path: { "${params.outdir}/${params.trimmer}" },
          mode: params.publish_dir_mode,
          report: { filename -> filename ==~ '.*\.tsv' ? [weight: 5, title: "Trimmer Gene Counts", mimeType: "text/plain"] : null }
      ],
]

ewels commented 1 year ago

Yup, like the idea of a map - much more extensible and clearly associated 👍🏻

maxulysse commented 10 months ago

Could one add file from collectFile to this report too?

pditommaso commented 9 months ago

I believe we reached the limit of the publishDir model; above all because it was designed for the dsl1 syntax and never worked properly for dsl2 world.

If you look at the config of nfcore/rnaseq pipeline, there are more than 1k lines of code to configure mostly the publishdir!

This should be redesigned from scratch in order to get rid of all the configuration boilerplate and, even more, make it possible to define a formal output definition (i.e. schema) both at process and workflow level.

I think the key points should be:

allow the definition of the data type of each process output
decouple the output type definition from the process definition, likely using a module level schema definition
including in this schema definition other metadata, such as: description, file extensions to be captured, report file, etc
allow composing of processes output schema into a top-level workflow output schema

pditommaso commented 9 months ago

Looking always the rnaseq config, most of the code is to define the sub-directory where the process output should be written.

I think could be dramatically simplified, reversing the problem. Instead of specifying process by process where the output should be written, I'd like to define an output (directory) tree, listing the processes that contribute to each path e.g.

'genome': { GFFREAD, GTF2BED, GTF_FILTER, .. }
'genome/index': { SALMON_INDEX, KALLISTO_INDEX, .. }

Though, it still it can be too verbose. Likely it should be introduced some kind of semantic annotation that would allow to tag all processes that need to contribute to a specific path e.g. genome_files, genome_index, etc. Then use this annotation to (re)map to target storage path.

Thoughts?

bentsherman commented 9 months ago

It seems there are two ways to think about process output "data types":

the in-memory data type
the output directory structure

For example, a process output that emits a list of files for each task will have an in-memory type of List<Path>, but in the output directory it might just be a subdirectory or glob pattern. You could also think about the file type (i.e. mime type) of these files.

I like the idea of separating these concepts, and defining the output directory structure in terms of the process outputs. I did a similar thing with the annotation API to enable custom types for process inputs:

// process inputs
take 'sample', type: Sample
// file staging
path { sample.files }

And a symmetric approach to enable custom types for process outputs:

// file unstaging
path '$file1', '*.fastq'
// process outputs
emit { new Sample(id, path('$file1') }, name: 'samples'

Don't worry so much about the syntax, it's just to illustrate how the staging/unstaging of files to/from the task environment is separated from the process inputs/outputs definition in order to enable custom types. Now, the "publishing" of process outputs to the output directory of a workflow run is basically the same thing at a higher level.

What I am imagining is the ability to specify the entire output directory structure of a workflow:

[
  'fastqc': [
    FASTQC.out.html
  ],
  'genome': [
    GFFREAD.out, GTF2BED.out, GTF_FILTER.out, // ...
  ],
  'genome/index': [
    SALMON_INDEX.out, KALLISTO_INDEX.out, // ...
  ],
  'multiqc': [
    MULTIQC.out
  ]
]

Again, just an illustrative syntax. Probably would need to be extended to support metadata and maybe file types. Maybe use a builder syntax instead of a map. Although it would be verbose for a large pipeline, it would be much simpler than the current modules.config approach as seen in nf-core/rnaseq, because there is much less duplicate/boilerplate code.

The main question I have is where to put it. It probably needs to be configurable separate from the pipeline code, which suggests it should be in the config file. But also it seems to be tied to workflow definitions, and ideally the pipeline output schema would be a composition of the subworkflow schemas.

This makes me think we should take a similar approach to the module config effort:

the output schema for a process or workflow is defined in a module config file alongside the module script
a process output schema isn't useful by itself but can be reused in workflow output schemas
a workflow output schema can reference the outputs of processes that it calls just like in the workflow emit: section

I think this is the right direction, but will need to develop a prototype and iterate on it to find a syntax that is intuitive and meets all of our needs. If we can come up with a comprehensive syntax that can handle the complexity of nf-core/rnaseq (plus the extra metadata requested in this issue), it should be easier from there to build some shorthands for simpler use cases.

pditommaso commented 9 months ago

It seems there are two ways to think about process output "data types":

the in-memory data type

the output directory structure

Good point. Tho I'd argue the first are related to internal intra-tasks "communication", the latter is related to the external workflow output, that should be the focus of the replacement of the publishDir.

Likely the first could be generalised to capture also the workflow output, but I fear it could become too complex

bentsherman commented 9 months ago

I agree I'd rather not try to tackle both at once. Maybe we can design the workflow output schema in a way that doesn't require new functionality in the process output definition.

If we only consider output files, then it should be straightforward. But if we also want to include metadata in the output schema (i.e. val process outputs), I'm not yet sure how to do that. Static metadata like descriptions should be easy. But it sounds like people will want to include things like the meta map in this output schema so that it can be queried by downstream workflows. Since people usually encode metadata in the output file names, maybe we could start with that. I will have to think on it further.

pditommaso commented 9 months ago

it turns out, nf-core people have already done most of the job! 😆

https://github.com/nf-core/modules/blob/master/modules/nf-core/parabricks/fq2bam/meta.yml#L48C8-L77

I think we should build on this, add the missing metadata and "formalise" it as a core spec

ewels commented 9 months ago

hah, yes we have the meta.yml file. We currently mostly auto-generate this file by parsing the Nextflow code for the process. Then the developer adds the descriptions. The original idea when I made it was that at some point in the future (when we have time™️ ) it could be used to create some kind of visual workflow builder. I figured it might be useful for soemthing either way and didn't want to retrofit it for 1000s of modules, so we put the file in place from the beginning. However, it's used for very little at the moment. Possibly just the website docs I think.

It has a few issues as it stands:

At module level, not pipeline
It's specifying output channels, not which files to publish
It's a separate file - not part of the current pipeline or config files

But having it or something like it as part of a solution could be good 👍🏻

pditommaso commented 9 months ago

At module level, not pipeline

yeah. that's good! the workflow schema will be managed separately

It's specifying output channels, not which files to publish

But it can be extended adding also the report files, the tags that should be applied to the files, etc

It's a separate file - not part of the current pipeline or config files

That's good as well!

bentsherman commented 9 months ago

Let's move the discussion of output schema to #4670

nextflow-io / nextflow