piquelab / QuASAR

Quantitative Allele Specific Analysis of Reads. Joint genotyping & ASE inference for RNA-seq data
MIT License
26 stars 10 forks source link

QuASAR: Quantitative Allele Specific Analysis of Reads

QuASAR (Harvey et al, 2015) is an R package, that implements a statistical method for: i) genotyping from next-generation sequencing reads, and ii) conducting inference on allelic imbalance at heterozygous sites. The sequencing data can be RNA-seq, DNase-seq, ATAC-seq or any other type of high-throughput sequencing data. The input data to QuASAR is a processed pileup file (as detailed later). Here, we do not cover in depth important pre-processing steps such as choice of the aligner, read filtering and duplicate removal. For our new method on MPRA data analysis please check the mpra folder.

We also want to emphasize that the current software is still in development, we would kindly appreciate any comments and bug reports.

1. Installation

To install from within an R session:

require(devtools)
install_github('piquelab/QuASAR')
library('QuASAR')

However, this method is occasionally problematic. Alternatively, you can clone/fork this repository and then build the package:

git clone https://github.com/piquelab/QuASAR.git
R CMD build QuASAR

then in R,

install.packages('QuASAR_x.y.tar.gz')
library(QuASAR)

2. Preprocessing

Alignment & filtering

Raw reads can be aligned to the reference genome using your favorite aligner. Because allele-specific analysis is extremely sensitive to read biases and mapping errors, we strongly recommend adding steps to remove PCR duplicates and to remove reads aligning to areas with known mappability issues (e.g., Degner et al, 2009).

Pileups & cleaned pileups

Note: These steps require samtools and bedtools.

Using the samtools mpileup command, create a pileup file from aligned reads. Provide a fasta-formatted reference genome (hg19.fa) and a bed file of positions you wish to pileup on (e.g., 1K genomes SNP positions 1KG snp file):

samtools mpileup -f hg19.fa -l snps.af.bed input.bam | gzip > input.pileup.gz

Next, convert the pileup file into bed format and use intersectBed to include the allele frequencies from a bed file. The bed file with allele frequencies should be seven columns: 1-3) coordinate, 4) SNP ID, 5) reference allele, 6) alternate allele, 7) allele frequency. This can be the same 1KG snp file used in the pileup stage. The awk filter step (below) removes positions not covered by a read, positions covered by indels, and reference skips:

less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin | intersectBed -a stdin -b snps.af.bed -wo | cut -f 1-7,11-14 | gzip > input.pileup.bed.gz

Finally, get the read counts at each position, and, if desired, perform any additional filtering. The result will be the input file for QuASAR. An example processing script is provided here: scripts/convertPileupToQuasar.R.

R --vanilla --args input.pileup.bed.gz < convertPileupToQuasar.R

Here is an example of how the QuASAR infput file should look:

zless input.quasar.in.gz | head -5
chr1    879910  879911  G   A   rs143853699 0.02    21  0   0
chr1    892379  892380  G   A   rs150615968 0.0041  22  0   0
chr1    893384  893385  G   A   rs140972868 0.01    6   0   0
chr1    894101  894102  A   T   rs188691615 0.01    6   0   0
chr1    894430  894431  G   A   rs201791495 9e-04   9   0   0

The fields are as follows:

  1. Chromosome
  2. Start position
  3. End position
  4. Reference allele
  5. Alternate allele
  6. SNP ID
  7. SNP allele frequency
  8. Number of reads mapping to the reference allele
  9. Number of reads mapping to the alternate allele
  10. Number of reads not mapping to either allele

3. Running QuASAR

Prepare the input samples

For a test run we provide a small sample dataset containing 6 samples from the same individual. The following commands will download the data to the current folder:

urlData="http://genome.grid.wayne.edu/quasar/sampleinput/"
fileNames <- paste0("t",c(2,4,6,12,18,24),"hr_Huvec_Rep1.quasar.in.gz")
sapply(fileNames,function (ii) download.file(paste0(urlData,ii),ii))

To run the sample data, or any data, we provide a few helper functions to merge samples across the union of all annotated sites (UnionExtractFields), and to filter sites with insufficient coverage across all samples (PrepForGenotyping). Note: these functions utilize calls to bedtools.

ase.dat <- UnionExtractFields(fileNames, combine=TRUE)
ase.dat.gt <- PrepForGenotyping(ase.dat, min.coverage=5)
sample.names <- colnames(ase.dat.gt$ref)

Genotyping an individual from multiple samples

Genotyping an individual using fitAseNullMulti requires a matrix of reference counts and a matrix of alternate counts where where the columns are ordered by sample. The final argument is a matrix of priors for the minor allele frquency, for which we use the 1K genomes MAFs assumed to be at Hardy-Weinberg equilibrium.

ase.joint <- fitAseNullMulti(ase.dat.gt$ref, ase.dat.gt$alt, log.gmat=log(ase.dat.gt$gmat))

This function returns a list with the following members:

names(ase.joint)
[1] "gt"        "log.gt"    "eps"       "loglik"    "logliksum"

where the posterior probability of the genotypes, gt, across all samples are accessed as follows:

head(ase.joint$gt)
               g0           g1           g2
[1,] 2.870026e-98 1.000000e+00 2.939460e-70
[2,] 1.465195e-27 7.773259e-04 9.992227e-01
[3,] 3.732811e-61 4.308038e-07 9.999996e-01
[4,] 9.992226e-01 7.774208e-04 1.714236e-27
[5,] 9.435425e-87 9.726281e-10 1.000000e+00
[6,] 9.999863e-01 1.372351e-05 6.274482e-46

g0=homozygous reference, g1=heterozygous, & g2=homozygous alternate. To save the output genotype probabilities together with the SNP annotation, we do:

out_dat <- data.frame(ase.dat.gt$annotations[, -5], map=ase.joint$gt)
write.table(out_dat, file='genotypes.txt', row.names=FALSE, col.names=FALSE, quote=FALSE,sep="\t")

Estimates of error parameters eps for each sample are:

ase.joint$eps
[1] 0.0008748778 0.0007617141 0.0008152132 0.0007819780 0.0008956686
[6] 0.0007597717

Inference on ASE

Using aseInference to conduct inference on ASE for an individual requires the posterior probabilities of each genotypes from the previous step "gt", estimates of sequencing error for each sample "eps", the same priors used in the previous step, reference counts, alternate counts, minimum coverage, sample names, and variant annotations.

ourInferenceData <- aseInference(gts=ase.joint$gt, eps.vect=ase.joint$eps, priors=ase.dat.gt$gmat, ref.mat=ase.dat.gt$ref, alt.mat=ase.dat.gt$alt, min.cov=10, sample.names=sample.names, annos=ase.dat.gt$annotations)

This function returns a list where each element corresponds to an input sample:

names(ourInferenceData[[1]])
[1] "dat"        "n.hets"     "dispersion"

where dat contains estimates of allelic imbalance betas, standard errors betas.se, & pvalues from an LRT for ASE detailed in [Harvey et al, 2014]. Note that the number of rows (SNPs) in each sample corresponds to the the number of heterozygous SNPs passing a minimum coverage filter.

 head(ourInferenceData[[1]]$dat)
 annotations.rsID annotations.chr annotations.pos0       betas  betas.se    pval3 
1        rs2272757            chr1           881626  0.15175892 0.6005410 0.80049721
2        rs2465128            chr1           981930  0.17948875 0.6445723 0.78065789
3        rs9442391            chr1           984301 -0.15175892 0.6005410 0.80049721
4       rs12142199            chr1          1249186 -0.43478406 0.4845478 0.36955958
5           rs7290            chr1          1477243 -0.99328368 0.5969363 0.09611857
6           rs7533            chr1          1479332 -0.09853221 0.3981711 0.80455070

The final members of the list are the number of heterozygotes and the esimtate of dispersion for each sample.

head(ourInferenceData[[1]]$n.hets)
[1] 2856
head(ourInferenceData[[1]]$dispersion)
[1] 64.07152

The code for this sample workflow is located here: scripts/exampleWorkflow.R