rwtourdot / linker

Tools for analyzing long and linked read sequencing
5 stars 3 forks source link

Linker - Tools for Analyzing Long and Linked Read Sequencing

Table of Contents

Installation

This code requires bamtools, htslib, c++11, and zlib libraries.

From the linker directory run build.sh or install manually:

Installing htslib locally

git clone https://github.com/samtools/htslib
cd htslib
autoheader
autoconf
./configure --prefix=/path/to/linker/packages/htslib/
make; make install
cd ..

Installing bamtools locally

git clone git://github.com/pezmaster31/bamtools.git
cd bamtools
mkdir build; cd build
cmake -DCMAKE_INSTALL_PREFIX=/path/to/linker/packages/bamtools/ ..
make; make install
cd ../..

Installing python dependencies and scripts:

cd <root of linker>
virtualenv -p python3 env
source env/bin/activate
./setup.py [develop|install]

Make an empty output directory if not present:

mkdir ./output

Description

Linker is a suite of C++ tools useful for interpreting long and linked read sequencing of cancer genomes. The most significant information long read sequencing provides is the local haplotype of a sample. In cancer cell lines where Aneuplpoidy, Loss of Heterozygosity, and Structural Variation is common, haplotypes can provide a better resolution of the samples karyotype, and clarify the cancer cells genomic evolution.

This program was used to determine Whole Chromosome Haplotypes as detailed here: biorxiv. Linker currently supports 10X, Pacbio, Oxford Nanopore, and HiC sequencing technologies. A diagram of the germline haplotype phasing workflow is shown below.

Associated plotting scripts found in /R_plots can be used to determine block energy cutoffs, assess haplotype accuracy, and plot haplotype specific copy number.

Commands

List Commands

./linker -h

Extract Phasing Information from Long and Linked Reads

This command extracts all long and linked read phasing information from an aligned bamfile given a corresponding vcf file containing heterozygous sites. The long read technology flag (tenx,pacbio,nanopore,hic) should correspond to the bamfile chosen.

./linker extract -i ./input.bam -v het_sites.vcf -e tenx -c chr21 -n sample_name

The output of this command is a graph_variant file which lists all of the unique hashes associated with each het site base. This file can be concatenated with other graph_variant files, to combine all phasing information from multiple technologies.

Solve For Local Haplotype Blocks from Long Read Links

After extracting all phasing information into graph_variant files and combining or trimming certain hashes the samples haplotype can be solved for by:

./linker solve -i ./graph_variant_{}_chr21.dat -c chr21 -n sample_name

The haplotype file contains a Block Switch Energy column which can be used to define blocks. The lower (more negative) the block energy the more likely a heterozygous site is phased correctly. More details on defining blocks can be found in the paper.

Generate A Whole Chromosome Haplotype Scaffold

This command takes a haplotype solution file and combines it with hic phasing information in a corresponding graph_variant_hic file to generate a full chromosome haplotype scaffold. An energy cutoff which defines haplotype blocks is specified by the -e flag.

./linker scaffold -i ./hap_solution_{}_chr21.dat -g ./graph_variant_hic_chr21.dat -e -700 -c chr21 -n sample_name

The hap_full_scaffold file contains less heterozygous sites than the haplotype solution file but is accurate over the full length of the chromosome.

Recover and Phase Variants to a Haplotype Scaffold

This command takes a haplotype scaffold file and phases all variants in a graph file to that scaffold. This is useful since some small haplotype blocks and their variants are lost when generating a haplotype scaffold. Though this command was built to recover variants, it can also be used to phase somatic variants to the germline haplotype - given a corresponding graph file.

./linker recover -i ./hap_full_scaffold_{}_chr21.dat -g ./graph_variant_{}_chr21.dat -e tech -c chr21 -n sample_name

The recovered haplotype file is similar to the scaffold file but its additional columns explicitly count the number of links connecting the reference/variant base to haplotype A/B.

Extract Heterozygous Site Coverage

This command takes a vcf and bam file and extracts the read coverage of each base. Base counts at each heterozygous site must pass a base quality and map quality cutoff.

./linker coverage -i ./input.bam -v het_sites.vcf -e illumina -c chr21 -n sample_name

Input/Output

Required Files

.bam file

The .bam file can originate from Long Read or Linked Read sequencing. The alignment method will depend on the type of sequencing, but a consistent genome reference should be used between samples.

.vcf file

The .vcf file can be obtained with GATK (https://software.broadinstitute.org/gatk/) or another variant caller and should contain all heterozygous sites. Indels will not be considered in linkers phasing methods, and SNP's called in centromere and variable regions of the genome should be removed. .vcf files should have the same reference as the long or linked read .bam file

Generated Files

graph_variant file

A file which extracts all phasing information in a .bam file. These files can be concatenated given the same sample and chromosome.

variant_id  chrom position  tech_hash readname
46692825_C_C    chr21   46692825    GATAACCGTACTGCTA-1  HMVT3CCXX:6:2107:9790:22739

hap_solution file

A file that contain all haplotype blocks solved for from a graph_variant file. The haplotype column defines each het sites local haplotype in a block defined by blockE.

index position  reference_id  alternate_id  haplotype ref_count alt_count spinE blockE  default_block range
0   13000241    chr21_13000241_A_A  chr21_13000241_A_C  1   23  30  -2083.39    -2083.39    0   27362

scaffold file

A file that contains the final haplotype accurate over the whole chromosome. The scaffold_hap column is a spin vector which defines the germline haplotype of the sample.

chrom arm position  reference_id  alternate_id  scaffold_hap  block_hap block spinE blockE
chr21   q   13207111    13207111_T_T    13207111_T_C    -1  block:  1   18  tenx_energy(spin/block):    -168    -168

het_coverage file

A file that contains base counts at each heterozygous site.

variant_id  position  variant:reference Indel Deletion  Gbase Cbase Abase Tbase Total
190204289_T_C   190204289   T:C I|0 D|0 G|0 C|44    A|0 T|30    74