Closed mahesh-panchal closed 1 year ago
Potential alternative: https://github.com/nextflow-io/nextflow/issues/2527
Awesome summary!
I think if we just include tests for the edge cases as a form of documentation of how best to use the modules, I think those are fine solutions.
Maybe a README in these special modules' directory as well, with some further explanation? Possibly we put them under subworkflows/
to easily differentiate between the "pure" modules that have a single process. Since this is accomplishing the same thing as a subworkflow, it's just using a process instead of a workflow to get around nextflow's submission limitations.
Hi there!
We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.
Cheers the nf-core maintainers
Is your feature request related to a problem? Please describe
Subworkflows are currently only comprised of modules and subworkflows. This is a nice design for easily composing smaller workflows. However, there are subworkflows that launch short running tasks, which when run as separate processes is computationally inefficient on both HPC and cloud systems ( using up unnecessary resources for scheduling, early resource release, input and output over network between processes, etc ).
Describe the solution you'd like
Allow subworkflows to be optimized and replaced with a single process definition when there are short running ( < 5 mins with a normal size data set ) processes in the subworkflow.
For example. The subworkflow
becomes:
Computationally, this is run on the same compute resources, no scheduling is required between processes, no extra file staging between processes, saving both time and resources.
The
workflow
block generally remains unchanged. Only configuration needs to be updated if a subworkflow is replaced with a process. This may even make configuration files smaller.All the
ext.prefix
andext.argsX
should be annotated with which tool they're supplying data to, however I think a better solution would be to be more explicit with names e.g.edge case
An edge case that isn't covered by the current system is also running the same short running process on multiple files. Currently workflows will spawn a task for each file since modules are atomic. However subworkflows cannot improve on this design, but a process can. One can provide
TOOL_ITERATOR
processes ( however under the subworkflows folder - or a better name to reflect additional optimization going on ) which apply the tool to a collection of files, e.g., input fromch.collect()
,ch.buffer()
,ch.collate()
,ch.groupTuple()
, orch.collectFile()
.Overall, this leads to better practice workflows, that are not only easy to compose, but computationally more efficient too.
Describe alternatives you've considered
An alternative that is often suggested is requesting process grouping like Snakemake (https://snakemake.readthedocs.io/en/stable/executing/grouping.html) from Nextflow. However, given how Nextflow is programmed, this is also computationally inefficient as files are still staged in and out of the workdir, where as the above would avoid all of that ( and is easier to implement ).