tdhock / PeakSegFPOP

Command line limited memory PeakSeg functional pruning optimal partitioning algorithm
0 stars 2 forks source link

** Deprecated

Please don't use this code! If you want to use PeakSeg for genome-wide peak calling, please use [[https://github.com/tdhock/PeakSegPipeline][PeakSegPipeline]] instead. (it implements the same algorithms, but with a better user interface; it should be easier to install and use)

** Old README below

This repository contains scripts for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.

[[https://travis-ci.org/tdhock/PeakSegFPOP][https://travis-ci.org/tdhock/PeakSegFPOP.png?branch=master]]

** Example output

[[http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-labels/H3K36me3_TDH_immune/][Broad H3K36me3 data: Train on 21 immune cell samples, genome-wide prediction on 29 samples of several cell types]].

[[http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-labels/H3K4me3_TDH_immune/][Sharp H3K4me3 data: train on 27 immune cell samples, genome-wide prediction on 37 samples of several cell types]].

[[http://cbio.mines-paristech.fr/~thocking/hubs/test/demo/][test_demo.R data: train on four labeled samples (two H3K36me3 + two Input), predict on 8 samples in a subset of chr10]]

** Installation

The following commands can be used to install all the required dependencies on Ubuntu precise (linux.x86_64 architecture). If you have another system, make sure to install R, BerkeleyDB STL, and UCSC tools (bigWigToBedGraph, bedToBigBed). Then execute [[file:packages.R]] to install all required R packages.

+BEGIN_SRC shell-script

sudo aptitude install build-essential r-recommended libdb6.0++-dev libdb6.0-stl-dev bedtools export BIN=~/bin # or any other directory on your $PATH curl -L http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph -o $BIN/bigWigToBedGraph curl -L http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed -o $BIN/bedToBigBed chmod u+x $BIN/bigWigToBedGraph $BIN/bedToBigBed make cp PeakSegFPOP $BIN Rscript packages.R

+END_SRC

BerkeleyDB is used by PeakSegFPOP to write a large temporary file to disk. If you don't have root or aptitude, then you can always install BerkeleyDB by hand to your home directory

+BEGIN_SRC shell-script

wget http://download.oracle.com/berkeley-db/db-6.2.23.NC.tar.gz tar xf db-6.2.23.NC.tar.gz cd db-6.2.23.NC/build_unix ../dist/configure --prefix=$HOME --enable-stl make make install

+END_SRC

If BerkeleyDB is installed to your home directory or some other non-standard directory, then make sure to edit the [[file:Makefile]] to tell the compiler where to find it.

+BEGIN_SRC

g++ -std=c++0x -o PeakSegFPOP PeakSegFPOPLog.cpp funPieceListLog.cpp -I$HOME/include -L$HOME/lib -ldb_stl -Wl,-rpath=$HOME/lib

+END_SRC

Once everything has been installed, you can test your installation by executing [[file:test_demo.R]]. It will first download some bigWigs and labels to the =test/demo= directory, then run [[file:pipeline.R]] on them. If everything worked, you can view the results by opening =test/demo/index.html= in a web browser, and it should be the same as the results shown on http://cbio.mines-paristech.fr/~thocking/hubs/test/demo/

** Input Files

The [[file:pipeline.R]] script uses PeakSegFPOP + [[https://github.com/tdhock/PeakSegJoint][PeakSegJoint]] to predict common and different peaks in multiple samples. It requires three kinds of input data:

The last step of [[file:pipeline.R]] is [[file:plot_all.R]] which creates

To give a concrete example, let us consider the data set that is used when you run [[file:test_demo.R]].

*** Coverage data

Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files can be in either [[https://genome.ucsc.edu/goldenpath/help/bedgraph.html][bedGraph]] or [[https://genome.ucsc.edu/goldenpath/help/bigWig.html][bigWig]] format (bigWig is preferable since it is indexed and thus faster to read than bedGraph). For example [[file:test_demo.R]] downloads 8 files:

+BEGIN_SRC

test/demo/samples/bcell/MS026601/coverage.bigWig test/demo/samples/bcell/MS010302/coverage.bigWig test/demo/samples/Input/MS002201/coverage.bigWig test/demo/samples/Input/MS026601/coverage.bigWig test/demo/samples/Input/MS002202/coverage.bigWig test/demo/samples/Input/MS010302/coverage.bigWig test/demo/samples/kidney/MS002201/coverage.bigWig test/demo/samples/kidney/MS002202/coverage.bigWig

+END_SRC

In the example above we have the =test/demo= directory which will contain all data sets, labels, and peak calls for this particular project. The =samples= directory contains a sub-directory for each sample group (experimental conditions or cell types, e.g. =bcell= or =kidney=). Each sample group directory should contain a sub-directory for each sample (e.g. =MS002201= or =MS010302=). Each sample sub-directory should contain either a =coverage.bedGraph= or =coverage.bigWig= file with counts of aligned sequence reads.

Note that in this demonstration project, the groups with underscores are un-labeled samples (e.g. =bcell_=), and the groups without underscores are labeled samples (e.g. =bcell=). In real projects typically you would combine those two groups into a single labeled group, but in this project we keep them separate in order to demonstrate the prediction accuracy of the learning algorithm.

*** Label Data

The =project/labels/*.txt= files contain genomic regions with or without peaks. These labels will be used to train the peak prediction models (automatically select model parameters that yield optimal peak prediction accuracy). A quick and easy way to create labels is by visual inspection as in the [[http://cbio.mines-paristech.fr/~thocking/chip-seq-chunk-db/][McGill ChIP-seq peak detection benchmark]] (for details please read [[http://bioinformatics.oxfordjournals.org/content/early/2016/10/23/bioinformatics.btw672.abstract][Hocking et al, Bioinformatics 2016]]).

To visually label your data first create a project directory on a webserver with =project/samples/groupID/sampleID/coverage.bigWig= files, then create a track hub using a command such as

+BEGIN_SRC shell-script

Rscript create_track_hub.R project http://your.server.com/~user/path- hg19 email@domain.com

+END_SRC

The arguments of the =create_track_hub.R= script are as follows:

If that command worked, then you should see a message =Created http://your.server.com/~user/path-project/hub.txt= and then you can paste that URL into [[http://genome.ucsc.edu/cgi-bin/hgHubConnect#unlistedHubs][My Data -> Track Hubs -> My Hubs]] then click Add Hub to tell the UCSC genome browser to display your data. Navigate around the genome until you have found some peaks, then add positive and negative labels in =project/labels/*.txt= files.

For example in [[file:test_demo.R]] the data set contains only one labels file,

+BEGIN_SRC

test/demo/labels/some_labels.txt

+END_SRC

which contains lines such as the following

+BEGIN_SRC

chr10:33,061,897-33,162,814 noPeaks chr10:33,456,000-33,484,755 peakStart kidney chr10:33,597,317-33,635,209 peakEnd kidney chr10:33,662,034-33,974,942 noPeaks

chr10:35,182,820-35,261,001 noPeaks chr10:35,261,418-35,314,654 peakStart bcell kidney

+END_SRC

A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.

In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain

Visualizing labels. After having added some labels in =project/labels/*.txt= files, run =Rscript convert_labels.R project= to create =project/all_labels.bed=. Then when you re-run =Rscript create_track_hub.R ...= it will create a new hub with a track "Manually labeled regions with and without peaks" that displays the labels you have created.

*** Genomic segmentation problems

The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps). This file should be in [[https://genome.ucsc.edu/FAQ/FAQformat#format1][BED]] format (e.g. [[file:hg19_problems.bed]]).

If you don't use hg19, but you do use another standard genome that is hosted on UCSC, then you can use [[file:downloadProblems.R]]

+BEGIN_SRC shell-script

Rscript downloadProblems.R hg38 hg38_problems.bed

+END_SRC

If your reference genome does not exist on UCSC, you can use [[file:gap2problems.R]] to make a =problems.bed= file:

+BEGIN_SRC shell-script

Rscript gap2problems.R yourGenome_gap.bed yourGenome_chromInfo.txt yourGenome_problems.bed

+END_SRC

where the chromInfo file contains one line for every chromosome, and the gap file contains one line for every gap in the reference (unknown / NNN sequence). If there are no gaps in your genome, then you can use =yourGenome_chromInfo.txt= as a =problems.bed= file.

** Running steps of the pipeline in parallel

Since the human genome is so large, we recommend to do model training and peak prediction in parallel. To use a PBS/qsub cluster such as Compute Canada's [[http://www.hpc.mcgill.ca/index.php/guillimin-status][guillimin]], begin by editing the [[file:create_problems_all.R]] script to reflect your cluster configuration. Then run

+BEGIN_SRC shell-script

cd PeakSegFPOP Rscript convert_labels.R test/demo Rscript create_problems_all.R test/demo

+END_SRC

That will create problem sub-directories in =test/demo/samples///problems/*=. Begin model training by computing =target.tsv= files:

+BEGIN_SRC shell-script

for lbed in test/demo/samples///problems/*/labels.bed;do qsub $(echo $lbed|sed 's/labels.bed/target.tsv.sh/');done

+END_SRC

The target is the largest interval of log(penalty) values for which PeakSegFPOP returns peak models that have the minimum number of incorrect labels. The =target.tsv= files are used for training a machine learning model that can predict optimal penalty values, even for un-labeled samples and genome subsets. To train a model, use

+BEGIN_SRC shell-script

Rscript train_model.R test/demo

+END_SRC

which trains a model using =test/demo/samples///problems/*/target.tsv= files, and saves it to =test/demo/model.RData=. To compute peak predictions independently for each sample and genomic segmentation problem,

+BEGIN_SRC shell-script

for sh in test/demo/problems/*/jointProblems.bed.sh;do qsub $sh;done

+END_SRC

which will launch one job for each genomic segmentation problem. Each job will make peak predictions in all samples, then write =test/demo/problems//jointProblems/= directories with =target.tsv.sh= and =peaks.bed.sh= scripts. One directory and joint segmentation problem will be created for each genomic region which has at least one sample with a predicted peak. To train a joint peak calling model, run

+BEGIN_SRC shell-script

qsub test/demo/joint.model.RData.sh

+END_SRC

which will compute =test/demo/joint.model.RData= and =test/demo/jobs/*/jobProblems.bed= files. To make joint peak predictions, run

+BEGIN_SRC shell-script

for sh in test/demo/jobs/*/jobPeaks.sh;do qsub $sh;done

+END_SRC

To gather all the peak predictions in a summary on =test/demo/index.html=, run

+BEGIN_SRC shell-script

qsub test/demo/peaks_matrix.tsv.sh

+END_SRC

Finally, you can create =test/demo/hub.txt= which can be used as a track hub on the UCSC genome browser:

+BEGIN_SRC shell

Rscript create_track_hub.R test/demo http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP- hg19 email@domain.com

+END_SRC

The script will create =test/demo/samples///coverage.bigWig= and =test/demo/samples///joint_peaks.bigWig= files that will be shown together on the track hub in a multiWig container (for each sample, a colored coverage profile with superimposed peak calls as horizontal black line segments).

** The PeakSegFPOP command line program

The PeakSegFPOP program finds the peak positions and corresponding piecewise constant segment means which optimize the penalized Poisson likelihood.

+BEGIN_SRC shell-script

PeakSegFPOP coverage.bedGraph penalty [tmp.db]

+END_SRC

The first argument =coverage.bedGraph= is a plain text file with 4 tab-separated columns: chrom, chromStart, chromEnd, coverage (chrom is character and the others are integers). It should include data for only one chromosome, and no gap regions.

The second argument =penalty= is a non-negative penalty value, for example 0, 0.1, 1e3, or Inf.

The third argument =tmp.db= is optional. It is the path for a temporary file which takes O(N log N) disk space (N = number of lines in coverage.bedGraph). In practice you can expect the size of the temporary file and the computation time to be as in the table below. Min and max values show the variation over several values of the penalty parameter (larger penalties require more time and disk space), on an Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz.

| N | min(MB) | max(MB) | min(time) | max(time) | |---------+---------+---------+-----------+-----------| | 10000 | 12 | 43 | 1 sec | 2 sec | | 100000 | 189 | 627 | 12 sec | 25 sec | | 1000000 | 3462 | 7148 | 3 min | 5 min | | 7135956 | 5042 | 41695 | 18 min | 56 min | | 7806082 | 5270 | 33425 | 35 min | 167 min |

For a single run with penalty parameter =X=, the PeakSegFPOP program outputs two files. The =coverage.bedGraph_penalty=X_segments.bed= file has one line for each segment, and the following tab-separated columns: =chrom=, =chromStart=, =chromEnd=, =segment.type=, =segment.mean=. The =coverage.bedGraph_penalty=X_loss.tsv= has just one line and the following tab-separated columns:

** Related work

An in-memory implementation of PeakSegFPOP is available in the [[https://github.com/tdhock/coseg][coseg]] R package.

| implementation | time | memory | disk | |----------------+------------+------------+------------| | command line | O(N log N) | O(log N) | O(N log N) | | R pkg coseg | O(N log N) | O(N log N) | 0 |

Note that although both implementations are O(N log N) time complexity for N data points, the command line program is slower due to disk read/write overhead.