makovalab-psu / AmpliCoNE-tool

AmpliCoNE: Ampliconic Copy Number Estimator
1 stars 1 forks source link

AmpliCoNE: Ampliconic Copy Number Estimator

A tool to estimate the copy number of ampliconic gene families in human Y chromosome using Illumina whole genome sequencing data.

Installation

Dependencies

To run AmpliCoNE, you need the following list of tools and packages:

Python 2.7.x
Numpy   1.14.2
Pandas 0.23.4   
Pysam   0.15.0.1
Biopython 1.71
Bowtie2 version 2.2.9 

Steps to install the dependencies using conda

Create an environment.yml file as described below.

vim amplicone-environment.yml

name: amplicone
channels:
  - bioconda
dependencies:
  - python=2.7
  - pandas=0.23.4
  - numpy=1.14.2
  - pysam=0.15.0.1
  - biopython=1.71
  - bowtie2=2.*

Use the amplicone-environment.yml file to create an environment named "amplicone"

conda env create -f amplicone-environment.yml

Load the environment each time you run AmpliCoNE.

source activate amplicone

Install AmpliCoNE

Once you installed the things above, do the following:

Clone the repository https://github.com/makovalab-psu/AmpliCoNE-tool by running:

git clone https://github.com/makovalab-psu/AmpliCoNE-tool.git

AmpliCoNE usage for human chromosome Y ampliconic genes using hg38

Download the Y chromosome annotation file and the gene definition file here.

#download annotation and gene definition files for hg38
wget http://www.bx.psu.edu/medvedev_lab/amplicone/hg38/hg38_amplicone_files.tar.gz

#uncompress the files
tar -zxvf hg38_amplicone_files.tar.gz

#list files downloaded (gene_definition_hg38.tab  hg38_Ychromosome_annotation.tab)
ls hg38_amplicone_files/

Generate input BAM files

Generate BAM files by aligning your Illumina whole genome sequencing dataset to hg38 using BWA-mem. (hg38 version reference can be downloaded from UCSC genome browser or ENSEMBL). The BAM files must be sorted by position and indexed.

Download test dataset here.

#download test BAM file
wget http://www.bx.psu.edu/medvedev_lab/amplicone/hg38/test_data.tar.gz

#uncompress the files
tar -zxvf test_data.tar.gz

#list files downloaded (test.bam  test.bam.bai)
ls test_data/

AmpliCoNE-count usage

AmpliCoNE-count.py takes the gene definition file, annotation file and BAM file as input to estimate the copy number of the nine ampliconic gene families. In addition, the name of the Y chromosome (chrY|Y) as defined in reference and its sequence length are required parameters.

python AmpliCoNE-count.py --GENE_DEF hg38_amplicone_files/gene_definition_hg38.tab --ANNOTATION hg38_amplicone_files/hg38_Ychromosome_annotation.tab --BAM test_data/test.bam --CHR Y --LENGTH 57227415

Other parameters in AmpliCoNE-count.py

The length of the chromosome must be set while using AmpliCoNE for a reference other than hg38.

--LENGTH <int> (default:57227415; length of Y chromosome in hg38) 

IMPORTANT: DO NOT USE DEFAULT VALUE FOR NON HG38 REFERENCE

Parameter to define if the BAM file contains single end reads.

--READ PAIRED or SINGLE (default:PAIRED)

Output description

AmpliCoNE-count will generate two output files:

AmpliCoNE usage with other reference genomes / species

Before running in a different species, you must first perform steps 1-3 below. This needs to be run only once, and then the resulting files can be reused when analyzing dataset from same species. After these steps, AmpliCoNE-count can be run for each sample to estimate copy number, as described above.

Step 1: Download pre-requisite files

The pre-requisite files for most genomes can be downloaded from UCSC genome browser. Make sure that the chromosome annotation (chrY, Y) is the same in all the files. If the files are not available for download, please generate them using GEM-library, RepeatMasker, and TandemRepeatFinder tools.

Step2: Generate gene definition file

NOTE: For gene copies in control, the gene family name (3rd column) should be "CONTROL".(case sensitive)

How to generate the file:

NOTE: All the locations >99% identity representing a gene family should be present on Y chromosome only. Each gene copy within a family should be represented as a row in the gene definition file.

For human ampliconic genes the gene definition file is available here. Below is an example file:

START END TYPE
100 300 GF1
750 1050 GF1
3000 3300 GF1
9900 10900 GF2
2220 2320 GF2
3500 3750 CONTROL
4190 4420 CONTROL
4770 5040 CONTROL

Step 3: Generate Ychr_annotation file using AmpliCoNE-build

sh AmpliCoNE-build.sh -c chrY -i <REF.fasta> -m <REF_MAPPABILITY.bed> -r <REF_REPMASK.out> -t <REF_TRF.bed> -g <Gene_Definition.tab> -o chrY_annotation.tab

Use the output file (annotation file) from AmpliCoNE-build.sh and gene definition file to run AmpliCoNE-count.py and estimate the gene family copy numbers.

python AmpliCoNE-count.py --GENE_DEF <Gene_Definition.tab> --ANNOTATION <chrY_annotation.tab> --BAM <BAM> --CHR <chr> --LENGTH <int>