jfear commented 8 years ago

This PR is for the discussion of the SampleHandler interface. I don't think this is ready to be merged yet.

To Test you should be able to just run the mapping workflow. Everything should run as before with the addition of a treatment level merge.

General Idea

I don't want to confine myself to a specific folder structure, and want the added flexibility of aggregating across different levels. The basic idea is that the user can define levels of folder/file structure in the config. I have envisioned 4 main levels of control:

rawLevel: This is the folder/file structure off of the sequencer 'raw data'
runLevel: This is the technical replicate level (think machine 'run'). This level may be identical to the rawLevel, unless the FASTQ files are split into bins that need to be combined. This would also allow the user to change folder/file structure if they find the rawLevel structure unwieldy.
sampleLevel: This is the biological sample level. It would allow the combining to tech reps. Most of what we care about downstream would be at this level.
aggLevel: This would be more of an experiment wide aggregation level. I am still not entirely sure the best way to set this up. I was thinking the aggLevel would be the high level aggregation spot for doing summaries and reports. I also think it would be nice to create a fluid aggLevel2 which would be a list of levels that would allow the user to specify different metadata to aggregate across.
The Main Course

The SampleHandler class is in lcdb/interface.py. It is a single class that serves two main functions.

Constructs the target list
Builds input functions to move between levels.
Target List SampleHander.build_targets

This is not much different then what we had before. It takes a list of filename patterns and builds a list of files names to tell snakemake that we want. The big difference is the patterns now contain a string format reference to one of the levels mentioned above (see workflows\mapping\Snakemake: patterns). Because it is using the folder/file structure patterns from the config it allows the user to change the structure by only editing the config.

Input Functions `SampleHandler.make_input`

This is the bread and butter of the system. It allows the generation of snakemake input functions to allow the easy movement up levels. Or you can stay at the same level, but you don't gain anything over just using snakemake wildcard system.

Parameters

There are three parameters that refer to different parts of the file name. These parameters can either be given a string referring to the string format code (i.e., the name in the '{}') or a string containing the format.
- prefix Usually just pass 'prefix' if you used '{prefix}' in the output file name.
- midfix This is really hard for the snakemake wildcard system to parse correcly so I think it is best to pass a complete string with the midfix (i.e., '.cutadapt.{aligner_tag}'). As long as '{aligner_tag}' is in the config or sample table it will be filled in.
- suffix Similar to midfix, it is safest to just pass string.
agg is a bool to signal if you want to move up a level for the input files.
Example staying at the sample level

Given a basic rule

rule test:
    input: '{prefix}.cutadapt2.fastq.gz'
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

You could re-write this using SampleHanlder

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.{aligner_tag}.bam')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

not much fun :-(

Example of aggregation

However, when we want to move between levels SampleHandler is useful.

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2.{aligner_tag}', suffix='.sort.dedup.bam', agg=True)
    output: '{prefix}.cutadapt2.{aligner_tag}.merged.sort.dedup.bam'
    wrapper:
         ...

SampleHandler will look at the prefix, figure out what level it belongs too and what samples it represents. It will then go up a level (agg > sample > run > raw) and figure create files named with the other levels naming pattern.

Dessert

I have also begun unittesting in lcdb/test. I had added some basic tests for lcdb/helpers.py and more tests for lcdb/interface.py. I have also created a Makefile in root to run tests. More tests need to be added, but basic functionality is at least tested.

Summary

I am sure I have over complicated some things. I tried to add enough comments/docstrings to be self explanatory. Please feel free to hack/comment.

daler commented 8 years ago

In your example converting this:

rule test:
    input: '{prefix}.cutadapt2.fastq.gz'
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

to this:

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.{aligner_tag}.bam')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

should the suffix be changed to:

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.fastq.gz')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

? That is, it's not clear whether the suffix is the suffix of the requested input file or the suffix of the output file. I need to stare at this some more, looks very cool.

jfear commented 8 years ago

That was a typo, it should be the suffix of the input files.

lcdb / lcdb-workflows

Mapping workflow regex #40

General Idea

The Main Course

Target List `SampleHander.build_targets`

Input Functions `SampleHandler.make_input`

Parameters

Example staying at the sample level

Example of aggregation

Dessert

Summary

lcdb / lcdb-workflows

Mapping workflow regex #40

General Idea

The Main Course

Target List SampleHander.build_targets

Input Functions SampleHandler.make_input

Parameters

Example staying at the sample level

Example of aggregation

Dessert

Summary

Target List `SampleHander.build_targets`

Input Functions `SampleHandler.make_input`