[FEATURE] Improve computational efficiency of subworkflows

mahesh-panchal commented 2 years ago

Is your feature request related to a problem? Please describe

Subworkflows are currently only comprised of modules and subworkflows. This is a nice design for easily composing smaller workflows. However, there are subworkflows that launch short running tasks, which when run as separate processes is computationally inefficient on both HPC and cloud systems ( using up unnecessary resources for scheduling, early resource release, input and output over network between processes, etc ).

Describe the solution you'd like

Allow subworkflows to be optimized and replaced with a single process definition when there are short running ( < 5 mins with a normal size data set ) processes in the subworkflow.

For example. The subworkflow

workflow BAM_STATS_SAMTOOLS {
    take:
    ch_bam_bai // channel: [ val(meta), [ bam ], [bai/csi] ]

    main:
    ch_versions = Channel.empty()

    SAMTOOLS_STATS ( ch_bam_bai, [] )
    ch_versions = ch_versions.mix(SAMTOOLS_STATS.out.versions.first())

    SAMTOOLS_FLAGSTAT ( ch_bam_bai )
    ch_versions = ch_versions.mix(SAMTOOLS_FLAGSTAT.out.versions.first())

    SAMTOOLS_IDXSTATS ( ch_bam_bai )
    ch_versions = ch_versions.mix(SAMTOOLS_IDXSTATS.out.versions.first())

    emit:
    stats    = SAMTOOLS_STATS.out.stats       // channel: [ val(meta), [ stats ] ]
    flagstat = SAMTOOLS_FLAGSTAT.out.flagstat // channel: [ val(meta), [ flagstat ] ]
    idxstats = SAMTOOLS_IDXSTATS.out.idxstats // channel: [ val(meta), [ idxstats ] ]

    versions = ch_versions                    // channel: [ versions.yml ]
}

becomes:

process BAM_STATS_SAMTOOLS {

    // conda + container stuff

    input:
    tuple val(meta), path(bam), path(index)

    output:
    tuple val(meta), path "*.stats"   , emit: stats
    tuple val(meta), path "*.flagstat", emit: flagstat
    tuple val(meta), path "*.idxstats", emit: idxstats
    path "versions.yml"               , emit: versions

    script:
    // samtools stats opts - ( bad example since no tool below uses args or prefix but the concept is here for example )
    def args    = task.ext.args   ?: ''
    def prefix  = task.ext.prefix ?: meta.id
    // samtools flagstat opts
    def args2   = task.ext.args2   ?: ''
    def prefix2 = task.ext.prefix2 ?: meta.id
    // samtools idxstats opts
    def args3   = task.ext.args3   ?: ''
    def prefix3 = task.ext.prefix3 ?: meta.id
    """
    samtools stats --threads $task.cpus $bam > ${bam}.stats
    samtools flagstat --threads $task.cpus $bam > ${bam}.flagstat
    samtools idxstats $bam > ${bam}.idxstats

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        samtools: \$(samtools --version |& sed 's/^.*samtools //; s/Using.*\$//')
    END_VERSIONS
    """

}

Computationally, this is run on the same compute resources, no scheduling is required between processes, no extra file staging between processes, saving both time and resources.

The workflow block generally remains unchanged. Only configuration needs to be updated if a subworkflow is replaced with a process. This may even make configuration files smaller.

All the ext.prefix and ext.argsX should be annotated with which tool they're supplying data to, however I think a better solution would be to be more explicit with names e.g.

ext.prefix -> ext.samtools_stats_prefix
ext.args   -> ext.samtools_stats_args

edge case

An edge case that isn't covered by the current system is also running the same short running process on multiple files. Currently workflows will spawn a task for each file since modules are atomic. However subworkflows cannot improve on this design, but a process can. One can provide TOOL_ITERATOR processes ( however under the subworkflows folder - or a better name to reflect additional optimization going on ) which apply the tool to a collection of files, e.g., input from ch.collect(), ch.buffer(), ch.collate(), ch.groupTuple(), or ch.collectFile().

process GUNZIP_ITERATOR {

    input:
    path archives // List of archives from collect(), buffer(), or similar channel operators

    script:
    // Tools are parallelized to a list of files using `xargs`( or `parallel` if present in the container )
    """
    printf "%s\\n" $archives | \\
        xargs -P $task.cpus -I {} \\
        bash -c 'gzip -cdf {} > \$( basename {} .gz )'
    """
}

Overall, this leads to better practice workflows, that are not only easy to compose, but computationally more efficient too.

Describe alternatives you've considered

An alternative that is often suggested is requesting process grouping like Snakemake (https://snakemake.readthedocs.io/en/stable/executing/grouping.html) from Nextflow. However, given how Nextflow is programmed, this is also computationally inefficient as files are still staged in and out of the workdir, where as the above would avoid all of that ( and is easier to implement ).

mahesh-panchal commented 2 years ago

Potential alternative: https://github.com/nextflow-io/nextflow/issues/2527

edmundmiller commented 2 years ago

Awesome summary!

I think if we just include tests for the edge cases as a form of documentation of how best to use the modules, I think those are fine solutions.

Maybe a README in these special modules' directory as well, with some further explanation? Possibly we put them under subworkflows/ to easily differentiate between the "pure" modules that have a single process. Since this is accomplishing the same thing as a subworkflow, it's just using a process instead of a workflow to get around nextflow's submission limitations.

jasmezz commented 1 year ago

Hi there!

We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.

Cheers the nf-core maintainers

nf-core / modules