Open rjpbonnal opened 2 years ago
I can see the need you're describing.
I don't know if it's necessary for every module. Does the cat_fastq
module need this?
I think this could be brought back with a functions.nf
in those modules that you pointed out, this could be useful.
Maybe a POC of one or two of these and some test examples of the output could spark some more interest.
This may also just be pipeline specific for now, you're not going to use cellranger in chipseq for example.
Is your feature request related to a problem? Please describe
Intro
nf-core, with its community, is building a collection of reusable pipelines. With the introduction of DSL2, nf-core created the modules project which aim is to decoupling the inner part of the pipelines to find the common elements and create small bricks that can be reused as well. Each module is, usually associated to a single application of a precise functionality, which generates outputs in form of nextflow channels. Channels can be "attached" to: files; stdout; and in general tuples generated and defined inside the output of the module's process. The output's channels can be used to communicate with the main orchestrator and pass data around to the subsequent steps. It's a best practice to emit those files which contain log or information about the result of the software such as: statistics, info, log, metrics. The emitted files/channel are then feeding software to extract and collect those information in a centralized manner, the main software to perform this task in a comprehensive way, which also generates a fancy browsable and interactive html document is MultiQC.
Problem
Even if MultiQC is a widespread tool in bioinformatics to collect and visualize the metrics from the analyses performed during a workflow, it may be suboptimal to extract the information. Even more, nextflow's design is build around the dataflow paradigm. Information came out from a process and are digested by others like a continuous stream. In this regard the metrics (info, log, statistics, summary) generated by the software, inside the processes, are digested mainly (only) by MultiQC, which is usually at the end of the pipeline. If the user wants to collect the metrics immediately after their production and eventually store them inside a database for further analytics he/she must post-process the files, those generated by the single processes or the MutliQC outputs. Another issues arise when the output of a software is not yet supported by MultiQC. In this case the user has two options: 1) contribute to MultiQC adding the specific plugin, 2) customize the pipeline and create a separate report. Both the solution are suboptimal, the first it is slow and require the developer to join another community, learn how to develop a plugin and then wait for the approval for its contribute; the second may break the pipeline and can be hard to follow the guidelines for the nf-core community (elaborate more on this).
Describe the solution you'd like
Solution
To follow, at the same time, the nextflow's paradigms/approches, the nf-core guidelines and giving the developer more flexibility and freedom in consuming the metrics, I propose to add to each module an output channel which emits the metrics. The name of the channel is (
metrics
) This could be a convention similar to theversions.yml
and the developer has the ability to develop its own way to generate the metrics starting from the software's output. The output format can be tabular or json (must be decided in advance). My experiments are based on the long table format which I prefer for simplicity and because it can be easily provided as input to many database engines.The emitting channel can avoid to emit the header because the number of the columns must be the same for ALL the modules. If all the channels are emitting the header it must be skipped at the time of collection, to avoid repetitions. The basic columns:
in this example the data type is missing, and I don't know how to encode it and which "standard" to use
I usually add the
nxf_session_id
when I collect the channels in themain.nf
, eventuallynxf_session_id
could be added directly inside the process. Thenxf_session_id
is crucial to connect the nextflow's log with the metrics from the different processes.My experiments are not at the "module" level. I process the metrics' channels in the
main.nf
, remember that this is a proof of concept. This approach as many disadvantages and does not allow a decentralized development by every nf-core module's owner (which I aim to).Ideally each module should have a Python? script for processing its own output, assuming that:
So this is a proposal and I hope to have the chance to discuss more on this.
Describe alternatives you've considered
Examples
ChIP-seq
An example of parsing the results from CellRanger output
Additional context
No response