lcdb / lcdb-workflows

DEPRECATED. Please see https://github.com/lcdb/lcdb-wf
MIT License
1 stars 0 forks source link

Mapping workflow regex #40

Closed jfear closed 8 years ago

jfear commented 8 years ago

This PR is for the discussion of the SampleHandler interface. I don't think this is ready to be merged yet.

To Test you should be able to just run the mapping workflow. Everything should run as before with the addition of a treatment level merge.

General Idea

I don't want to confine myself to a specific folder structure, and want the added flexibility of aggregating across different levels. The basic idea is that the user can define levels of folder/file structure in the config. I have envisioned 4 main levels of control:

The SampleHandler class is in lcdb/interface.py. It is a single class that serves two main functions.

  1. Constructs the target list
  2. Builds input functions to move between levels.

    Target List SampleHander.build_targets

This is not much different then what we had before. It takes a list of filename patterns and builds a list of files names to tell snakemake that we want. The big difference is the patterns now contain a string format reference to one of the levels mentioned above (see workflows\mapping\Snakemake: patterns). Because it is using the folder/file structure patterns from the config it allows the user to change the structure by only editing the config.

Input Functions SampleHandler.make_input

This is the bread and butter of the system. It allows the generation of snakemake input functions to allow the easy movement up levels. Or you can stay at the same level, but you don't gain anything over just using snakemake wildcard system.

Parameters

Given a basic rule

rule test:
    input: '{prefix}.cutadapt2.fastq.gz'
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

You could re-write this using SampleHanlder

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.{aligner_tag}.bam')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

not much fun :-(

Example of aggregation

However, when we want to move between levels SampleHandler is useful.

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2.{aligner_tag}', suffix='.sort.dedup.bam', agg=True)
    output: '{prefix}.cutadapt2.{aligner_tag}.merged.sort.dedup.bam'
    wrapper:
         ...

SampleHandler will look at the prefix, figure out what level it belongs too and what samples it represents. It will then go up a level (agg > sample > run > raw) and figure create files named with the other levels naming pattern.

Dessert

I have also begun unittesting in lcdb/test. I had added some basic tests for lcdb/helpers.py and more tests for lcdb/interface.py. I have also created a Makefile in root to run tests. More tests need to be added, but basic functionality is at least tested.

Summary

I am sure I have over complicated some things. I tried to add enough comments/docstrings to be self explanatory. Please feel free to hack/comment.

daler commented 8 years ago

In your example converting this:

rule test:
    input: '{prefix}.cutadapt2.fastq.gz'
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

to this:

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.{aligner_tag}.bam')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

should the suffix be changed to:

rule test:
    input: SH.make_input(prefix='prefix', midfix = '.cutadapt2', suffix='.fastq.gz')
    output: '{prefix}.cutadapt2.{aligner_tag}.bam'
    wrapper:
         ...

? That is, it's not clear whether the suffix is the suffix of the requested input file or the suffix of the output file. I need to stare at this some more, looks very cool.

jfear commented 8 years ago

That was a typo, it should be the suffix of the input files.