Config System YAML - Githubissues

jfear commented 8 years ago

Starting to think about the config system YAML.

#### General Settings ####
settings: # location of system level settings
    title: My Very Cool Project
    Author: Bob
    data: /data/bob/original_data # path like settings
    python2: py2.7 # conda environment names
    env: HOME # names to access specific envs by 

#### Experiment Level Settings ####
exp.settings: # experiment level settings, settings that apply to all samples
    sampleinfo: sample_metadata.csv # Sample information relating sample specific settings to sample ids
    fastq_suffix: '.fastq.gz'  # it would be nice to be able define a setting here that applies to all samples, or define for each sample in the sampleinfo table case they are different. 
    annotation: # Need to some way to specify annotation to use, maybe here is not the best place.
        genic: /data/...
        transcript: /data/....
        intergenic: /data/...
    models: # add modeling information here
        formula: ~ sex + tissue + time
        factors: # tell which columns in sample table should be treated like factors
             - sex
             - tissue
             - time

#### Workflow Settings ####
# I think using a naming scheme that follow folder structure would be useful. For example:
# if there is a workflows folder then we would have
workflows.qc: # could define workflow specific settings
    steps_to_run: # List pieces of the pipeline to run, (or not run may be better)
        - fastqc
        - rseqc
    trim: True # or could have logical operators switches to change workflow behavior

workflows.align:
    aligner: 'tophat2=2.1.0' # define what software to use and optionally what version
    aggregated_output_dir: /data/...
    report_output_dir: /data/...

workflow.rnaseq: ...

workflows.references: ... 

#### Rule Specific Settings ####
rules.align.bowtie2: # rule level settings again with naming based on folder structure if we need folder structure
    cluster: # It would be nice to be able to have cluster settings with rule setting, can't think of a way to get this to work, probably just need a separate cluster config.
        threads: 16
        mem: 60g
        walltime: 8:00:00
    index: /data/... # bowtie index prefix
    params: # Access to any parameters that need set
        options: -p 16 -k 8 # place to change the options
    aln_suffix: '.bt2.bam'  # place to change how files are named
    log_suffix: '.bt2.log'

daler commented 8 years ago

I was thinking about this some more. I tried #10 as a way of using the github code review tools to help discussion, but figured I'd just post here.

I really like having the One True Config split by workflows. I made some mostly organizational changes to what you have above:

Moved the per-rule config under a "rules" key in the respective workflow, so that the nesting of the config follows the nesting of the rules within workflows. It also allows for per-workflow configuration if a rule is used in multiple places (e.g., a bowtie rule in qc and a bowtie rule in align workflow).
Moved the rna-seq-specific stuff (models, factors) to workflows.rnaseq.
Added ability to specify multiple models; how much this will be used in practice, whether we should even expose this sort of complexity rather than just build templates for custom work, or what the particular format will be, remains to be figured out.
added an assembly key to the renamed global section
removed the fastq suffix, see below for discussion

global:
  title: My Very Cool Project
  Author: Bob
  assembly: dm6
  sampleinfo: sample_metadata.csv

workflows.qc:
  rules.trim:
    adapters: adapters.fa
    extra: "-q 20"

workflows.align:
  rules:
    align:
      aligner: 'bowtie==2.0.2'
      index: /data/...
      cluster:
        threads: 16
        mem: 60g
        walltime: 8:00:00
      aln_suffix: '.bt2.bam'
      log_suffix: '.bt2.log'
      extra: "-p {threads} -k 8"

workflow.rnaseq:
  factors:
    - sex
    - tissue
    - time
  models:
    full_model: ~ sex + tissue + time
    reduced_1: ~ sex + tissue

  rules:

    featurecounts:
      annotation: /data/gene.gtf
      extra: "-s 1"

    featurecounts_intergenic:
      annotation: /data/intergenic.gtf

config lookups

Specifying so much in the config will let us write some pretty generic workflows where input, output and params are basically just a ton of config dict lookups.

rule align:
    input:
        index=config['workflows.align']['rules']['align']['index']
    threads: config['workflows.align']['rules']['align']['cluster']['threads']
   ...

Some options to think about: if we wrap the config in an object with dotted access, then it becomes slightly more readable:

rule align:
    input:
        index=config.workflows_align.rules.align.index
    threads: config.workflows_align.rules.align.cluster.threads
   ...

Or syntax like the conda_build Metadata object,

rule align:
    input:
        index=config.get('workflows.align/rules/align/index')
    threads: config.get('workflows.align/rules/align/cluster/threads')
   ...

cluster config

I really like having the cluster config specified here alongside the rule. It could work if we provide a wrapper for calling snakemake that passes through most arguments, but extracts the cluster config info from the config file and builds a tempfile cluster_config.yaml that is passed to snakemake.

The threads configured here can be injected into the rules at the end of the workflow by modifying [rule.threads for rule in workflow.rules]

jfear commented 8 years ago

I like the re-organization, the nesting cleans things up a bit. I think the dot notation lookups seem the cleanest.

Sense the wrapper system is pulling the complexity out of the rules. I am thinking that the "workflow" should contain all of its own rules and try to make all of the settings some sort of lookup from the global config.

I also like having the cluster config side-by-side.

I will look at #10 and make individual comments there. Did not know you could do the line by line comments with PRs.

daler commented 8 years ago

Seems like an elegant option for the dot notation from http://stackoverflow.com/a/7534478. Given the function:

def cfg(val):
    current_data = config
    for chunk in val.split('.'):
        current_data = current_data.get(chunk, {})
    return current_data

the lookup becomes:

rule align:
    input:
        index=cfg('workflows_align.rules.align.index')
    threads: cfg('workflows_align.rules.align.cluster.threads')
   ...

The reason I like this is that the global config dict remains unchanged as a dict. The other answers in that stackoverflow question have other options, but I worry about converting the global config dict to something else in case snakemake is using it for other things we don't know about that assume the full dict interface.

daler commented 8 years ago

Also we really should have config validation once things settle down into a format. For example, we could keep a validation schema file that includes default values, then have code to build an example config using that schema and validates the generated config. The user edits that config, which is then validated again before use.

Luckily I have existing code for exactly this! I'll port it over.

lcdb / lcdb-workflows

Config System YAML #9

config lookups

cluster config