Open jfear opened 8 years ago
I like the one workflow to rule them all.
Rules that just call an external tool -- star, fastqc, whatever -- I think should be handled by wrappers. That's the equivalent of one rule per file, but with the added benefit of documentation and isolated environments (once the latter gets support from snakemake). Workflows then just are a series of configuring input/output/params, with most rules looking like:
rule one:
input: '{sample}.{suffix}'
output: '{sample}.{suffix}.{newsuffix}'
params: dict_of_params
wrapper:
wrapper_for('tool name')
Some rules won't fit cleanly into wrappers -- things like aggregation or building bigBeds from RNA-seq results. At this point I would guess that these kinds of rules aren't that common, and could be simple enough that they live in the workflow rather than one rule per file. Alternatively they can be factored out into an externally called script, so the more complicated stuff remains hidden in the script and the rule remains simply defined inside the workflow, like this (only the last two lines change):
rule two:
input: '{sample}.{suffix}'
output: '{sample}.{suffix}.{newsuffix}'
params: dict_of_params
script:
script_for('task')
The benefit of this is that we don't need additional infrastructure for pulling in individual rules and dealing with the hierarchical overrides of parameter options. The directory structure stays relatively flat:
config.yaml
Snakefile
workflows/
rnaseq.snakefile
qc.snakefile
references.snakefile
...
scripts/
aggregate_counts.py
bigbeds_from_rnaseq.py
...
wrappers/
fastqc/
samtools/
...
Well, scripts might need one level of nesting, but it wouldn't be too bad. Anyway, I think we should consider your last point to be a guiding principle -- workflows should be super clean and readable.
The big question is how to organize rules/workflows.
My pipeline had the hierarchy:
I like this setup, but we can debate it. The big question is should we separate snakefile and workflows. I like having the top level snakefile, because you can then do project specific per-procesing and hides the guts from the user. However, maybe this is not really an issue if the workflows are super clean and readable.