shawnzhangyx / PePr

a peak-calling and differential analysis tool for replicated ChIP-Seq data
GNU General Public License v3.0
37 stars 9 forks source link

PePr v1.1.10 or newer

Introduction

PePr is a ChIP-Seq Peak-calling and Prioritization pipeline that uses a sliding window approach and models read counts across replicates and between groups with a negative binomial distribution. PePr empirically estimates the optimal shift/fragment size and sliding window width, and estimates dispersion from the local genomic area. Regions with less variability across replicates are ranked more favorably than regions with greater variability. Optional post-processing steps are also made available to filter out peaks not exhibiting the expected shift size and/or to narrow the width of peaks.

Installation

  1. Make sure your python version is higher than 2.6. Version 3.X may not be fully supported yet.
  2. Install pip in your system if you don't have it.
  3. pip install PePr or pip install PePr --user(if you don't have administrator privilege). Optionally, you can download tarball (PePr-[version].tar.gz) from github and install using pip install PePr-[version].tar.gz
  4. If installation is successful, you could directly invoke the script by typing PePr. A help message will show up.

Supported File Formats

Scripts to call PePr

The following scripts are available after the installation and can be called directly from bash console/terminal.

Basic Usage Examples

Warning: These are working examples with minimal required parameters. For the best performance (or to avoid bad fitting) on your data, please read this manual carefully and choose the right parameters.

Parameters

Parameter Description
-p/--parameter-file Use parameter file instead of command line options. Using a parameter file will ignore all other command line options. See the next section for parameter file configuration.
-i/--input1 Group 1 input files. Multiple file names are separated by comma, e.g. input1.bam,input2.bam. you can also specify relative path to the file names, like folder1/input1.bam,folder2/input2.bam,folder3/input3.bam
-c/--chip1 Group 1 ChIP files.
--input2 Group 2 input files. Use in differential binding analysis.
--chip2 Group 2 ChIP files. Use in differential binding analysis.
-n/--name Experiment name. It will be prefix to all output files from PePr. Default: "NA"
-f/--file-format Read file format. Currently support bed, sam, bam, sampe (sam paired-end), bampe (bam paired-end)
-s/--shiftsize Half the fragment size. The number of bases to shift forward and reverse strand reads toward each other. If not specified by user, PePr will empirically estimate this number from the data for each ChIP sample.
-w/--windowsize Sliding window size. If not specified by user, PePr will estimate this by calculating the average width of potential peaks. The lower and upper bound for PePr estimate is 100bp and 1000bp. User provided window size is not constrained, but we recommend to stay in this range (100-1000bp).
--diff Tell PePr to perform differential binding analysis.
--threshold p-value cutoff. Default:1e-5.
--peaktype sharp or broad. Default is broad. PePr treats broad peaks (like H3k27me3) and sharp peaks(like most transcriptions factors) slightly different. Specify this option if you know the feature of the peaks.
--normalization inter-group, intra-group, scale, or no. Default is intra-group for peak-calling and inter-group for differential binding analysis. PePr is using a modified TMM method to normalize for the difference in IP efficiencies between samples (see the supplementary methods of the paper). It is making an implicit assumption that there is substantial overlap of peaks in every sample. However, it is sometimes not true between groups (for example, between TF ChIP-seq and TF knockout). So for differential binding analysis, switch to intra-group normalization. scale is simply scaling the reads so the total library sizes are the same. no normalization will not do normalization.
--keep-max-dup maximum number of duplicated reads at each single position to keep. If not specified, will not remove any duplicate.
--num-processors Number of CPUs to run in parallel.
--input-directory where the data files are. The path specified here will be a prefix added to each of the files. The best practice is to always use absolute path in here.
--output-directory where you want the output files to be. PePr will add this path as a prefix to the output files. It is recommended to use the absolute path.
--version Will show the version number and exit.

Parameter File

The parameter file is an easier way of running PePr by including the running parameters in one file. It is effectively the same as running from the command line. A basic example is provided below:

#filetype       filename
chip1   chip_rep1.bed
chip1   chip_rep2.bed
input1  input_rep1.bed
input1  input_rep2.bed
file-format     bed
peaktype     broad
difftest     FALSE
keep-max-dup 2
threshold     1e-5
name    test

PePr will also output a complete parameter file for you to keep a record of your running parameters and produce the same results.

Output Files

Links

Questions?

You're also welcome to shoot me an e-mail at yanxiazh@umich.edu, I'll try replying to you as soon as possible. In the e-mail, please include [1] a copy of your command/script to call PePr, [2] paramters.txt file, and [3] log file. It will speed up the troubleshooting process.

Cite PePr

Zhang Y, Lin YH, Johnson TD, Rozek LS, Sartor MA. PePr: A peak-calling prioritization pipeline to identify consistent or differential peaks from replicated ChIP-Seq data. Bioinformatics. 2014.