Open nick-youngblut opened 1 year ago
As another example, what is the best practice for aggregating multiple sets of files. For example:
per-file job:
input:
tuple val(sample_id), path("file.txt")
output:
path "${sample_id}_output1.txt", emit: output1
path "${sample_id}_output2.txt", emit: output2
For one output channel, one could just do the following:
process SUMMARY{
input:
path ("*")
output:
path "summary.txt"
script:
"""
[somehow get {sample_id} from the file path, and then aggregate the files, while including {sample_id}]
"""
...but for multiple output channels, the following doesn't work:
process SUMMARY{
input:
path ("*")
path ("*"
output:
path "summary.txt"
script:
"""
# for output1:
[somehow get {sample_id} from the file path, and then aggregate the files, while including {sample_id}]
# for output2:
[somehow get {sample_id} from the file path, and then aggregate the files, while including {sample_id}]
"""
Must one create a SUMMARY
process for each set of files (each path channel) emitted from the per-file process (e.g., process SUMMARY_OUTPUT1
and process SUMMARY_OUTPUT2
), or is there a better, more scalable & maintainable approach?
Assuming that one must provide metadata values via the file paths, when aggregating multiple files in Nextflow, I created the following script to aggregate files (assumes structured table files) and extracts values from the input file names (e.g., the sample name from files named "{sample_name}.tsv") and adds those extracted values as extra columns in the aggregated output table. I just used the same python code as used in snakemake for the same purpose.
Is this the "best" way of aggregating files and associated file metadata?
Currently, there is no example of aggregating files AND associated metadata. For instance, in many/most nf-core pipelines the process outputs are something like:
...but what if one wants to then aggregate all of the
file.txt
outputs into one table AND include themeta
metadata in that output table?As far as I can tell from scouring the nextflow slack channel, one must "embed" the metadata in the file paths and then parse the file paths in the aggregation step. For example:
Per-file process:
Aggregation process:
Is there a better way, especially given the substantial limitations of trying to embed metadata into a file path (eg., dealing with multiple values and special characters in the metadata values)?
I'm sure a lot of pipeline developers would like a best-practices example of how to deal with this situation (without having to decipher how
meta
is dealt with in aggregation steps of nf-core pipelines).