Rules, Workflows oh my!

I like the one workflow to rule them all.

Rules that just call an external tool -- star, fastqc, whatever -- I think should be handled by wrappers. That's the equivalent of one rule per file, but with the added benefit of documentation and isolated environments (once the latter gets support from snakemake). Workflows then just are a series of configuring input/output/params, with most rules looking like:

rule one:
    input: '{sample}.{suffix}'
    output: '{sample}.{suffix}.{newsuffix}'
    params: dict_of_params
    wrapper:
        wrapper_for('tool name')

Some rules won't fit cleanly into wrappers -- things like aggregation or building bigBeds from RNA-seq results. At this point I would guess that these kinds of rules aren't that common, and could be simple enough that they live in the workflow rather than one rule per file. Alternatively they can be factored out into an externally called script, so the more complicated stuff remains hidden in the script and the rule remains simply defined inside the workflow, like this (only the last two lines change):

rule two:
    input: '{sample}.{suffix}'
    output: '{sample}.{suffix}.{newsuffix}'
    params: dict_of_params
    script:
        script_for('task')

The benefit of this is that we don't need additional infrastructure for pulling in individual rules and dealing with the hierarchical overrides of parameter options. The directory structure stays relatively flat:

config.yaml
Snakefile

workflows/
   rnaseq.snakefile
   qc.snakefile
   references.snakefile
   ...

   scripts/
      aggregate_counts.py
      bigbeds_from_rnaseq.py
      ...

   wrappers/
      fastqc/
      samtools/
      ...

Well, scripts might need one level of nesting, but it wouldn't be too bad. Anyway, I think we should consider your last point to be a guiding principle -- workflows should be super clean and readable.

lcdb / lcdb-workflows

Rules, Workflows oh my! #7

snakefile

workflow

rules

scripts