ijuric / MAPS

18 stars 11 forks source link

!! An Important Announcement !!

March 25th 2022: MAPS has a new home! For a newest version check here: https://github.com/HuMingLab/MAPS

MAPS pipeline user manual

What is MAPS pipeline?

MAPS (Model-based Analysis of PLAC-Seq data) pipeline is a a set of multiple scripts used to analyze PLAC-Seq and HiChIP data. MAPS pipeline contains multiple scripts. The run_pipeline.sh is a prototypes of a shell script containing the locations of files needed to run MAPS is entered as well as calls for different MAPS pipeline scripts. Internally, there are two parts of MAPS pipeline. First part, called feather, does mapping and preprocessing of pair-end reads and creates long and short .bed/.bedpe files. Second part, called MAPS, does read binning and peak calling. For additional description of MAPS method, check our paper: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006982 For any questions regarding MAPS please send mail to Ivan Juric ( ivan.juric.gen@gmail.com ) or Ming Hu ( hum@ccf.org )

MAPS runs with Arima HiChIP kit using a slightly modified run_pipeline file. Please refer to https://github.com/ijuric/MAPS/tree/master/Arima_Genomics for more information

MAPS pipeline setup

Downloading prerequisites and MAPS pipeline

MAPS requires following programs and packages. Install them prior to using MAPS. MAPS runs on Linux.

python 3.4 (or later)

Python libraries:

R 3.4.3

R Packages:

Juicer tools (needed only if you want .hic file; in feather dir on github) https://github.com/theaidenlab/juicer/wiki/Download

Numbers in parentheses are versions we used, and while it is very likely that MAPS will run with newer versions, we do not guarantee it.

Add bwa, bedtools, samtools to the path. Check here how to add to set up path on linux: http://hmgaudecker.github.io/econ-python-environment/paths.html

Index your reference genome using bwa To index genome using bwa, type: ''' bwa index [ref_genome_fastq_file] ''' Download reference genome. In our paper, we use ones downloaded from ucsc genome browser. For example: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

Alternatively, you can download genomes mentioned in Heng Li's blog here: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

MAPS requires files with genome features information to run properly. We provide those files for 5kb and 10kb resolution. If you want to run MAPS at different resolution, download appropriate genome features files from here: http://enhancer.sdsc.edu/yunjiang/resources/genomic_features/

Download or clone from here: https://github.com/ijuric/MAPS

Running MAPS

Run one run_pipeline script per biological replica. For each biological replica, run_pipeline script runs feather, MAPS and MAPS-tools parts. Run MAPS pipeline for each replica before running it for merged dataset. That way, we will not need to map reads for merged dataset, but will use .bed/.bedpe files from each replica. It is crucial to add appropriate information about all biological replicas in the run_pipeline script. We show below how to do that.

Copy MAPS/bin/run_pipeline.sh to MAPS/bin/runpipeline[PROJECT_NAME].sh, where [PROJECT_NAME] is the name of your set of biological replicas.

In runpipeline[PROJECT_NAME].sh set parameters:

Running MAPS

Run appropriately set up run_pipeline script

MAPS Output

MAPS generates multiple files. Main output is .peaks.bedpe file which contains the list of significant 3D interactions in .bedpe format.

Example scripts

test example

File bin/run_pipeline_test.sh is a small examlpe that runs fast. Use it to confirm that MAPS is running properly. Currently, it is set to run from my home directory (/home/jurici/). You will need to change the following paths to correspond to locations of files on your computer:

python_path=/home/abnousa/software/python3.6.5/bin/python #should have pysam, pybedtools installed. bedtools, samtools should be in the path
Rscript_path=/opt/R-3.4.3/lib64/R/bin/Rscript
fastq_dir="/home/jurici/MAPS/examples/test_set1"
outdir="/home/jurici/MAPS/examples/test_set1/output"
macs2_filepath="/home/jurici/MAPS/examples/test_set1/macs2_peaks_final.replicated.narrowPeak"
bwa_index="/home/jurici/MAPS/MAPS_data_files/"$organism"/BWA_index/mm10_chrAll.fa"

fastq and 1D peaks (macs2) files are in your MAPS folder (/home/[USER]/MAPS/examples)

Run run_pipeline_test.sh script

Results should looks omething like this: https://github.com/ijuric/MAPS/blob/master/examples/expected_output_test_set1.bedpe

GM12878 cells example:

In this example (bin/run_pipeline_example.sh), we show how to run MAPS on one HiChIP dataset on Smc1a cohesin subunit in human B lymphocyte GM12878 cells. After successfully finishing setting described in MAPS pipeline setup, you will need to download data files:

To download HiChIP GM data from NCBI site, you will need NCBI SRA Toolkit, which can be downloaded here: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software Once you download and set up SRA Toolkit, to download .fastq files type:

./fastq-dump --split-files SRR3467175
./fastq-dump --split-files SRR3467176

This will download SRR3467175_1.fastq, SRR3467175_2.fastq, SRR3467176_1.fastq and SRR3467176_2.fastq files to your computer. Next, merge fastq files from technical replicas:

cat SRR3467175_1.fastq SRR3467176_1.fastq > GM12878_R1.fastq
cat SRR3467175_2.fastq SRR3467176_2.fastq > GM12878_R2.fastq

This is a proper way of merging fastq files from technical replicas becasue all duplicated reads will be removed during feather step of MAPS pipeline. We cannot merge biological replicas this way, since then we risk loosing some reads when removing duplicated. This is why, when we run MAPS on merged biologal replicas, we first need to run MAPS on each replica first, and then use mapped and filtered reads to call 3D interactions.

run run_pipeline_example.sh

After you run this example, output should look like this: https://github.com/ijuric/MAPS/blob/master/examples/GM12878.5k.2.peaks.bedpe