lcdb / lcdb-workflows

DEPRECATED. Please see https://github.com/lcdb/lcdb-wf
MIT License
1 stars 0 forks source link

Rules, Workflows oh my! #7

Open jfear opened 8 years ago

jfear commented 8 years ago

The big question is how to organize rules/workflows.

My pipeline had the hierarchy:

snakefile

Contain sample name and basic organization imports workflows

workflow

contains process logic sets up output file naming scheme imports rules and scripts

rules

each rule in separate files could import scripts

scripts

various python scripts/classes

I like this setup, but we can debate it. The big question is should we separate snakefile and workflows. I like having the top level snakefile, because you can then do project specific per-procesing and hides the guts from the user. However, maybe this is not really an issue if the workflows are super clean and readable.

daler commented 8 years ago

I like the one workflow to rule them all.

Rules that just call an external tool -- star, fastqc, whatever -- I think should be handled by wrappers. That's the equivalent of one rule per file, but with the added benefit of documentation and isolated environments (once the latter gets support from snakemake). Workflows then just are a series of configuring input/output/params, with most rules looking like:

rule one:
    input: '{sample}.{suffix}'
    output: '{sample}.{suffix}.{newsuffix}'
    params: dict_of_params
    wrapper:
        wrapper_for('tool name')

Some rules won't fit cleanly into wrappers -- things like aggregation or building bigBeds from RNA-seq results. At this point I would guess that these kinds of rules aren't that common, and could be simple enough that they live in the workflow rather than one rule per file. Alternatively they can be factored out into an externally called script, so the more complicated stuff remains hidden in the script and the rule remains simply defined inside the workflow, like this (only the last two lines change):

rule two:
    input: '{sample}.{suffix}'
    output: '{sample}.{suffix}.{newsuffix}'
    params: dict_of_params
    script:
        script_for('task')

The benefit of this is that we don't need additional infrastructure for pulling in individual rules and dealing with the hierarchical overrides of parameter options. The directory structure stays relatively flat:

config.yaml
Snakefile

workflows/
   rnaseq.snakefile
   qc.snakefile
   references.snakefile
   ...

   scripts/
      aggregate_counts.py
      bigbeds_from_rnaseq.py
      ...

   wrappers/
      fastqc/
      samtools/
      ...

Well, scripts might need one level of nesting, but it wouldn't be too bad. Anyway, I think we should consider your last point to be a guiding principle -- workflows should be super clean and readable.