orzechoj / piq-single

PIQ DNA footprinting. Cloned from https://bitbucket.org/thashim/piq-single
2 stars 3 forks source link

IMPORTANT - READ ME

Frequently asked questions

If you have a question read me first

common.r

MODIFY genome to fit your data, match bis("BSgenome.*") to your genome

As of bioconductor 2.14 some of the package names have changed. Description below:

Organism package name description
Human BSgenome.Hsapiens.UCSC.hg19.masked Human genome + repeatmask
Human BSgenome.Hsapiens.UCSC.hg19 Human genome (set mapq=0 or blacklist)
Mouse BSgenome.Mmusculus.UCSC.mm10 Mouse genome (set mapq=0 or blacklist)
Yeast BSgenome.Scerevisiae.UCSC.sacCer2 Yeast (set mapq=0)

In general we suggest repeatmasking the genome when using unique maps (mapq>=1). hg19.masked does this by default for mm10 and sacCer2, the masked genomes do not have repeatmask, and we'd suggest blacklisting them via a blacklist or use non-unique maps.

input files

use jasparfix.txt in the pwms folder (all JASPAR including PBM hits) for a comprehensive list. If for whatever reason you only want JASPAR CORE hits, use jaspar.txt.

How to run

Running a single motif

  1. Make sure you can execute common.r without errors (this make sure your R is set up correctly).

  2. Generate the PWM hits across genome (does not depend on choice of BAM)

    Rscript pwmmatch.exact.r /cluster/thashim/basepiq/common.r /cluster/thashim/basepiq/pwms/jasparfix.txt 139 /cluster/thashim/PIQ/motif.matches/

    This uses the genome and PWM cutoffs in common.r with the 139th motif in jaspar.txt (CTCF) and writes the matches as a binary R file called 139.RData in tmppiq.

  3. Convert BAM to internal binary format (does not depend on choice of motif).

    Rscript bam2rdata.r /cluster/thashim/basepiq/common.r /cluster/thashim/PIQ/d0.RData /cluster/cwo/dnase_seq/bams/D0_50-100_130801.bwa.mapq20.mm10.bam

    This takes the D0_50-100_130801.bwa.mapq20.mm10.bam file and retains only the read end locations, storing it into d0.RData.

  4. Combine BAM + PWM to make calls, depends on choice of BAM and PWM. If you have multiple simultaneous runs, make sure each run gets a unique tmp folder.

    Rscript pertf.r /cluster/thashim/basepiq/common.r /cluster/thashim/PIQ/motif.matches/ /scratch/tmp/ /cluster/thashim/130130.mm10.d0/ /cluster/thashim/PIQ/d0.RData 139

    This takes settings in common.r to call the motif match 139.RData (CTCF, from above) using data d0.RData (from bam2rdata) and writing the output into 130130.mm10.d0. The tmp folder stores some large temporary matrices and can be wiped after the run.

Running multiple motifs

Bash scripts for running multiple motifs are provided as part of the utils/sgepwm.sh and utils/sgecalls.sh

Special use cases

Multiple replicates

If you have replicates, please merge them using samtools merge.

Multiple experiments

If you have multiple experiments in the same experimental condition but different assays, bam2rdata.r will take multiple bamfiles as part of its argument and use them simultaneously.

Control experiments

If you have control experiments such as genomic DNA or naked DNase-I digestion and want to know if your results are significant with respect to control. Do the following:

`Rscript pertf.bg.r /cluster/thashim/basepiq/common.r /cluster/thashim/PIQ/motif.matches/ /scratch/tmp/ /cluster/thashim/130130.mm10.d0/ /cluster/thashim/PIQ/d0.RData /cluter/thashim/PIQ/control.RData 139`

This will use the profile / footprint learned from the training data (d0.RData) and output scores over the control data.

Train and test separately

The control experiments case can be used to learn footprints from a separate dataset than used for calls. This may be useful if

Parameters

Check before every run

Quality control / blacklisting

Changing motif match candidates

Note: Increasing motif match candidate count may adversely affect performance due to inability to distinguish non-target TF binding with similar DNase profiles.

About the output

For each PWM, PIQ outputs 3 files: two .csv files containing call scores and a .pdf containing summary and goodness of fit info.

*-calls.all.csv

This file contains ALL instances of motif matches from pwmmatch.r and its associated scores. This file is useful if you want to re-do the cutoffs yourself or if you want to look at instances where the motif is matched but the factor is unbound.

*-calls.csv

This file contains instances of motif match whose scores had purity above the purity.cuotff value in common.r. Purity is a proxy for PPV. A purity of 0.7 means 70% of the entries in calls.csv will be bound (estimated using background binding sites).

*-diag.pdf

This file contains diagnostic pdf output of the motif and its binding sites.

page 1 (input motif used to match)

page 2 (global statistics)

page 3 (site to site statistics)