tprodanov / parascopy

Copy number estimation and variant calling for duplicated genes using WGS.
MIT License
22 stars 3 forks source link

GitHub Bioconda Last updated

Parascopy

Parascopy is designed for robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing.

Created by Timofey Prodanov timofey.prodanov[at]hhu.de and Vikas Bansal vibansal[at]ucsd.edu at the University of California San Diego.

Table of contents

Citing Parascopy

If you use Parascopy, please cite:

Installation

Parascopy is written in Python (≥ v3.6) and is available on Linux and macOS. You can install Parascopy using conda:

conda config --add channels bioconda
conda config --add channels conda-forge
conda install -c bioconda parascopy

In most cases, it takes 2-6 minutes to install Parascopy using conda. If you have problems with installing Parascopy using conda, it is possible to create a new conda environment (see more details here):

conda create --name paras_env parascopy
conda activate paras_env

Alternatively, you can install it manually using the following commands:

git clone https://github.com/tprodanov/parascopy.git
cd parascopy
python3 setup.py install

Parascopy depends on several Python modules (see here). To install Parascopy without dependencies, you can run

python3 setup.py develop --no-deps

Additionally, you can specify installation path using --prefix <path>.

In case of manual installation, Parascopy requires

In order to run variant calling (parascopy call), Parascopy requires a modified Freebayes executable. It is installed automatically via conda, but for manual installation you need to run

# Within the main parascopy directory.
git clone --recursive https://github.com/tprodanov/freebayes
cd freebayes
./compile.sh
cd ..

General usage

See parascopy help or parascopy <command> --help for more information. Additionally, you can find a test dataset and pipeline here.

Homology table

Parascopy uses a database of duplications - homology table. To construct a homology table you would need to run:

parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz

Note, that the reference genome should be indexed with both samtools faidx and bwa index. Alternatively, you can download a precomputed homology table.

Estimating copy number

In order to find aggregate and paralog-specific copy number profiles (agCN and psCN), you can run the following commands:

# Calculate background read depth.
parascopy depth -I input.list -g hg38 -f genome.fa -o depth
# Estimate agCN and psCN for multiple samples.
parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1
# Estimate agCN and psCN for single/multiple samples using model parameters from a previous run.
parascopy depth -I input2.list -g hg38 -f genome.fa -o depth2
parascopy cn-using analysis1/model -I input2.list -t table.bed.gz -f genome.fa -d depth2 -o analysis2

Calling variants

Parascopy variant calling in duplicated regions is available since version v1.9. Currently, variant calling is performed on top of an existing copy number analysis. To run variant calling you can execute the following command:

parascopy call -p analysis1 -f genome.fa -t table.bed.gz

Input files

Input alignment files should be sorted and indexed, both .bam and .cram formats are supported. Input filenames can be passed into Parascopy as -i in1.bam in2.bam ... or as -I input-list.txt, where input-list.txt is a text file with a single filename on each line.

Additionally, you can provide/override sample names with -i in1.bam::sampleA in2.bam::sampleB or, in case of -I input-list.txt, as a second entry on each line:

in1.bam sampleA
in2.bam sampleB

Modifying reference copy number

After agCN estimation, parascopy searches for reliable PSVs. However, only samples that are consistent with the reference copy number are used (in a two-copy duplication, only samples with agCN = 4 will be used). Additionally, by default, sex chromosomes X and Y are treated as regular chromosomes. To solve both issues, one can use --modify-ref argument and provide updated reference copy numbers for some samples and some regions.

The BED file can contain the following lines:

#CHROM  START    END  SAMPLES  CN  # This line is not necessary.
chr1    10000  20000  *         0  # It is known that the region is missing from all samples.
chrX        0    inf  SAMPLE1,SAMPLE2,SAMPLE3  1  # Male samples have one chrX and one chrY.
chrY        0    inf  SAMPLE1,SAMPLE2,SAMPLE3  1
chrY        0    inf  SAMPLE4,SAMPLE5,SAMPLE6  0  # Female samples are missing chrY.

Visualizing output

It is possible to visualize agCN and psCN detection process. To do that you need to clone this repository and run scripts in draw directory. The scripts are written in R language and require a number of R packages:

install.packages(c('argparse', 'tidyverse', 'ggplot2', 'ComplexHeatmap', 'viridis', 'circlize', 'ggthemes', 'RColorBrewer'))

Output files

Copy number variation analysis (parascopy cn|cn-using) produces a range of files, described here.

Variant calling produces regular VCF files, additional information can be found here.

Precomputed data

You can use the following precomputed data:

Compatible reference genomes (without the ALT contigs) can be dowloaded from

Note that if the BAM/CRAM files contain ALT contigs, you should provide modified reference copy number values using --modify-ref argument, where all ALT contigs have the reference copy number 0. Additionally, you should use homology table constructed on a reference file with ALT contigs.

Test dataset

You can find the full test pipeline here. Alternatively, you can follow step-by-step instructions below:

First, place reference human genome (hg38), precomputed homology tables and model parameters in data directory. For single-sample analysis, we subsampled HG00113 human genome, which can downloaded here (360 Mb). Please, index the cram file using samtools index HG00113.cram.

We start by calculating background read depth, which should take 2-5 minutes:

parascopy depth -i HG00113.cram -f data/hg38.fa -g hg38 -o depth_HG00113

Next, we calculate copy number profiles for 167 duplicated loci using precomputed model parameters obtained using 503 European samples. This step takes 10-40 minutes depending on the number of threads (controlled by -@ N).

parascopy cn-using data/models_v1.2.5/EUR \
    -i HG00113.cram -f data/hg38.fa -t data/homology_table/hg38.bed.gz \
    -d depth_HG00113 -o parascopy_HG00113

You can analyze a subset of loci by specifying data/models_v1.2.5/EUR/<locus>.gz, for example analysis of the SMN1 locus should take less than a minute.

For multi-sample analysis, we extracted reads aligned to the GBA/GBAP1 locus for 503 European samples, can be downloaded here (195 Mb) (extract using tar xf GBA.tar.gz). Background read depth is already calculated and is located in GBA/1kgp_depth.csv.gz. Calculating copy number profiles for the GBA/GBAP1 locus should take around 25-30 minutes using a single core.

parascopy cn -I GBA/input.list  -f data/hg38.fa \
    -t data/homology_table/hg38.bed.gz -d GBA/1kgp_depth.csv.gz \
    -r chr1:155231479-155244699::GBA -o parascopy_GBA

Here, the region is supplied using the -r chr:start-end[::name] format, alternatively you can supply regions in a BED file using the -R argument (optional: fourth column with region names).

Sample output can be found here (251 Mb), output files description can be found here.

Frequently asked questions

You can find FAQ here.

Issues

Please submit issues here or send them to timofey.prodanov[at]gmail.com.

See also

Additionally, you may be interested in these tools: