oscarlr / IGenotyper

7 stars 3 forks source link

IGenotyper

Introduction
Installation
Getting IGH specific reference
Testing IGenotyper installation
Usage
Running IGenotyper
Explanation of steps
Output directories
Output files
Notes

Installation


git clone https://github.com/oscarlr/IGv2.git
cd IGv2

conda env create -f pygenometracks.yml

conda env create -f environment.yml
conda activate IGv2
python setup.py install
conda deactivate

conda create -n whatshap-latest python=3.7
conda activate whatshap-latest
pip install git+https://github.com/whatshap/whatshap
conda deactivate

conda activate IGv2
cd ..
git clone https://github.com/oscarlr/cluster
cd cluster
python setup.py install
export SJOB_DEFALLOC=NONE

# Install Kalign
conda deactivate
cd ..
wget https://github.com/TimoLassmann/kalign/archive/refs/tags/v3.3.tar.gz
tar -zxvf v3.3.tar.gz
cd kalign-3.3
./autogen.sh
./configure --prefix=$HOME
make
make check
make install

Getting IGH specific reference

wget http://immunogenomics.louisville.edu/immune_receptor_genomics/current/reference.fasta .
samtools faidx reference.fasta

sawriter reference.fasta
cp reference.fasta* ~/anaconda3/envs/IGv2/lib/python2.7/site-packages/IGenotyper-1.1-py2.7.egg/IGenotyper/data/

# For rheMac10
sawriter reference.fasta
cp reference.fasta* ~/anaconda3/envs/IGv2/lib/python2.7/site-packages/IGenotyper-1.1-py2.7.egg/IGenotyper/data/rhesus/

Testing IGenotyper installation

cd test
IG phase subset_bam/reads.bam test # There will be an warning message about no reads for hap0.bw -- you can ignore this
IG assembly test

Running IGenotyper

# <pacbio bam file> must have a pbi and bai index. They can be created using pbindex and samtools index, respectively.

IG phase <pacbio bam file> <output> 
IG assembly <output> 
IG detect <output> 

Usage

IG phase --help
usage: IG phase [-h] [--rhesus] [--sample SAMPLE] [--threads THREADS]
                [--mem MEM] [--cluster] [--queue QUEUE] [--walltime WALLTIME]
                [--tmp TMP] [--input_vcf VCF]
                BAM OUTDIR

positional arguments:
  BAM                  PacBio bam file
  OUTDIR               Directory for output

optional arguments:
  -h, --help           show this help message and exit
  --rhesus
  --sample SAMPLE      Name of sample
  --threads THREADS    Number of threads
  --mem MEM            Memory for cluster
  --cluster            Use cluster
  --queue QUEUE        Queue for cluster
  --walltime WALLTIME  Walltime for cluster
  --tmp TMP            Temporary folder
  --input_vcf VCF      Phased VCF file to phase reads
IG assembly --help
usage: IG assembly [-h] [--threads THREADS] [--mem MEM] [--cluster]
                   [--queue QUEUE] [--walltime WALLTIME]
                   OUTDIR

positional arguments:
  OUTDIR               Directory for output

optional arguments:
  -h, --help           show this help message and exit
  --threads THREADS    Number of threads
  --mem MEM            Memory for cluster
  --cluster            Use cluster
  --queue QUEUE        Queue for cluster
  --walltime WALLTIME  Walltime for cluster
IG detect --help
usage: IG detect [-h] [--hom HOM] OUTDIR

positional arguments:
  OUTDIR      Directory for output

optional arguments:
  -h, --help  show this help message and exit
  --hom HOM   Add homozygous reference genotype

Explanation of steps

Phase

In the first step --phase, the subreads and CCS reads are phased and aligned to the IGH specific reference. Each read has a read group annotation. A read group annotation of 1 and 2 corresponds to haplotype 1 and 2. The read group annotation of 0 corresponds to unassignable reads. In IGV, you can seperate these reads by left clicking and selecting group by read group.

Assemble

In the second step --assembly, the haplotypes are assembled. During this process folders will be created for each region/haplotype block. Within each folder there is a bash script that runs the assembly process. These can be submitted as a single job into the cluster (this speeds up the process).

Detect

In the third step --detect, SNVs, indels, SVs and gene/alleles are genotyped. A VCF file is created for the SNVs, a BED file for the indels and SVs, a TAB-delimited file for the gene/alleles calls.

Output directories

alignments alleles assembly logs plots preprocessed report.html tmp variants Directories Description
<output>/alignments Alignments of CCS, subreads and contigs (phased and unphased)
<output>/assembly Assembly of IGH locus
<output>/variants SNVs, indels and SVs
<output>/alleles Alleles in sample
<output>/logs Log files with input parameters
<output>/tmp Temporary files. Could be deleted.

Output files

  1. alignments/
    1. ccs_to_ref*: CCS reads aligned to reference
    2. contigs_to_ref*: All assembled contigs aligned to refernece
    3. igh_contigs_to_ref*: IGH assembled contigs aligned to igh reference
  2. assembly/
    1. contigs.fasta: All assembled contigs
    2. igh_contigs.fasta: IGH assembled contigs
  3. alleles/
    1. assembly_alleles.bed: Alleles extracted from the assembly for each gene
    2. assembly_genes.fasta: Fasta sequence from the assembly for each gene/allele
    3. ccs_alleles.bed: Alleles extracted from the CCS reads for each gene
    4. ccs_genes.fasta: Fasta sequence from the CCS reads for each gene/allele
  4. logs/
    1. gene_cov.txt: Haplotype coverage for each gene
  5. variants/
    1. snvs_phased_from_ccs.vcf: Phased SNVs detected from the CCS reads
    2. snvs_assembly.vcf: SNVS detected from the assembly
    3. indel_assembly.bed: Indels detected from the assembly
    4. sv_assembly.bed: SVs detected from the assembly
    5. phased_blocks.txt: Phased haplotype blocks

Todo

  1. Add IGL and IGK alleles to data/alleles.fasta