Closed jfear closed 8 years ago
In your example converting this:
rule test:
input: '{prefix}.cutadapt2.fastq.gz'
output: '{prefix}.cutadapt2.{aligner_tag}.bam'
wrapper:
...
to this:
rule test:
input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.{aligner_tag}.bam')
output: '{prefix}.cutadapt2.{aligner_tag}.bam'
wrapper:
...
should the suffix be changed to:
rule test:
input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.fastq.gz')
output: '{prefix}.cutadapt2.{aligner_tag}.bam'
wrapper:
...
? That is, it's not clear whether the suffix is the suffix of the requested input file or the suffix of the output file. I need to stare at this some more, looks very cool.
That was a typo, it should be the suffix of the input files.
This PR is for the discussion of the SampleHandler interface. I don't think this is ready to be merged yet.
To Test you should be able to just run the mapping workflow. Everything should run as before with the addition of a treatment level merge.
General Idea
I don't want to confine myself to a specific folder structure, and want the added flexibility of aggregating across different levels. The basic idea is that the user can define levels of folder/file structure in the config. I have envisioned 4 main levels of control:
rawLevel
: This is the folder/file structure off of the sequencer 'raw data'runLevel
: This is the technical replicate level (think machine 'run'). This level may be identical to therawLevel
, unless the FASTQ files are split into bins that need to be combined. This would also allow the user to change folder/file structure if they find therawLevel
structure unwieldy.sampleLevel
: This is the biological sample level. It would allow the combining to tech reps. Most of what we care about downstream would be at this level.aggLevel
: This would be more of an experiment wide aggregation level. I am still not entirely sure the best way to set this up. I was thinking theaggLevel
would be the high level aggregation spot for doing summaries and reports. I also think it would be nice to create a fluidaggLevel2
which would be a list of levels that would allow the user to specify different metadata to aggregate across.The Main Course
The SampleHandler class is in
lcdb/interface.py
. It is a single class that serves two main functions.Target List
SampleHander.build_targets
This is not much different then what we had before. It takes a list of filename patterns and builds a list of files names to tell snakemake that we want. The big difference is the patterns now contain a string format reference to one of the levels mentioned above (
see workflows\mapping\Snakemake: patterns
). Because it is using the folder/file structure patterns from the config it allows the user to change the structure by only editing the config.Input Functions
SampleHandler.make_input
This is the bread and butter of the system. It allows the generation of snakemake input functions to allow the easy movement up levels. Or you can stay at the same level, but you don't gain anything over just using snakemake wildcard system.
Parameters
prefix
Usually just pass 'prefix' if you used '{prefix}' in the output file name.midfix
This is really hard for the snakemake wildcard system to parse correcly so I think it is best to pass a complete string with the midfix (i.e., '.cutadapt.{aligner_tag}'). As long as '{aligner_tag}' is in the config or sample table it will be filled in.suffix
Similar to midfix, it is safest to just pass string.agg
is abool
to signal if you want to move up a level for the input files.Example staying at the sample level
Given a basic rule
You could re-write this using SampleHanlder
not much fun :-(
Example of aggregation
However, when we want to move between levels SampleHandler is useful.
SampleHandler will look at the
prefix
, figure out what level it belongs too and what samples it represents. It will then go up a level (agg > sample > run > raw) and figure create files named with the other levels naming pattern.Dessert
I have also begun unittesting in
lcdb/test
. I had added some basic tests forlcdb/helpers.py
and more tests forlcdb/interface.py
. I have also created a Makefile in root to run tests. More tests need to be added, but basic functionality is at least tested.Summary
I am sure I have over complicated some things. I tried to add enough comments/docstrings to be self explanatory. Please feel free to hack/comment.