DER Finder Beta

This is the beta version of the R package for DER Finder, a method for analyzing differential expression with RNA-seq data. DER Finder is described in detail in the paper Differential expression analysis of RNA-seq data at single-base resolution. Code for the analysis used in the paper is in analysis_code.R.

The package in this repository is an exact implementation of the methods described in the DER Finder manuscript. The final version of the software is significantly more efficient than this version and is available on Bioconductor. For research projects, please use the Bioconductor package instead of the package in this repository, which mostly exists for historical/reproducibility reasons. The official package is written and maintained by Leonardo Collado Torres, and issues can be submitted to its GitHub repository.

installation

You can install this package directly from GitHub using the install_github function from the devtools package:

library(devtools)
install_github('derfinder', 'alyssafrazee')

You may also need to install Genominator, limma, HiddenMarkov, splines, and locfdr.

preprocessing data

Before using the derfinder R package, raw RNA-seq data should be processed as follows:

read alignment

We use TopHat to align reads, but any junction-aware aligner that writes alignments in .bam format is appropriate.

After aligning reads, .bam files need to be indexed with samtools index. (See the SAMtools page for information about using and installing SAMtools.)

coverage calculation

DER Finder operates at single-base resolution, so we need per-nucleotide read coverage for every genomic position. We provide a Python script, countReads.py, that does this on a per-chromosome basis. For each sample, and for each chromosome you would like per-base counts for, you need to run:

python countReads.py --file accepted_hits.bam --output <OUTFILE> --kmer <READ_LENGTH> --chrom <CHROMOSOME> --stranded <IS_STRANDED>

where accepted_hits.bam is the file of read alignments for the sample (this is the default output name from TopHat, but if you used a different aligner, this filename might be different), <OUTFILE> is the name of the fill that will contain the per-nt counts, <READ_LENGTH> is the length of the RNA-seq reads in your sample, and <CHROMOSOME> is the chromosome you want to count. The chromosome name should match the chromosome names in accepted_hits.bam. <IS_STRANDED> should be one of TRUE, REVERSE, or FALSE. If you used a stranded protocol, <IS_STRANDED> should be TRUE. If you used a reversely-stranded protocol (e.g., an Illumina protocol where strand labels are reversed), <IS_STRANDED> should be REVERSE. If your protocol wasn't stranded, you can eliminate this option (it's FALSE by default).

countReads.py depends on the pysam module, which requires a Python version >2.4 and <3.0.

<OUTFILE> will be a 2-column tab-delimited text file, with genomic position in the first column and coverage in the second column. If --stranded is TRUE, two output files, called OUTFILE_plus and OUTFILE_minus, will be produced (one for each strand). If you used a stranded protocol and would like to find separate DERs for each strand, you will need to run the pipeline twice, starting here (once on each of the resulting coverage matrices).

merging coverage files

The <OUTFILES> for each sample will need to be merged to create a nucleotide-by-sample matrix for analysis. If you have a giant computer (i.e., if you can hold this matrix in memory), you can do this in with some code like this:

# which chromosome?
chr = "chr22"

# sample IDs
samps = c("orbFrontalF1", "orbFrontalF2", "orbFrontalF3", "orbFrontalF11", "orbFrontalF23", "orbFrontalF32", "orbFrontalF33", "orbFrontalF40", "orbFrontalF42", "orbFrontalF43", "orbFrontalF47", "orbFrontalF53", "orbFrontalF55", "orbFrontalF56", "orbFrontalF58")

# read in each sample:
countlist = list()
for(s in 1:length(samps)){
    print(paste("reading: sample",samps[s]))
    y = read.table(paste0(samps[s], "-outfile.txt"), sep="\t", header=FALSE)
    print("done reading.")
    countlist[[s]] = y$V2
    if(s==1) pos = y$V1
    if(s>1){
        if(length(y$V1)>length(pos)) pos = y$V1
    }
    rm(y);gc();gc();gc();gc()
  }

# put samples together and zero-pad as needed:
thelen = length(pos)
for(i in 1:length(countlist)){
    countlist[[i]] = c(countlist[[i]],rep(0,thelen-length(countlist[[i]])))
}
names(countlist) = samps
chr.table = as.data.frame(countlist)
chr.table = data.frame(pos, chr.table)
write.table(chr.table, file=paste0(chr, '_allcounts.txt'), row.names=FALSE, quote=FALSE, sep="\t")

This isn't super efficient, but it gets the job done. You can also use something like the Unix paste utility, making sure to account for the fact that each sample may start/end at slightly different positions.

loading data

After aligning and counting reads and merging each sample's read counts, we can begin using the derfinder package. We don't want to read the huge matrices into R, so we utilize SQLite databases - but everything is built into the R package, so no SQL commands are actually needed. The makeDb command creates the database, and you can explore the database with Genominator:

# create database from merged file:
library(derfinder)
makeDb(dbfile = 'chr22_allcounts.db', textfile='chr22_allcounts.txt', tablename = 'chr22')

# explore data:
library(Genominator)
dat = ExpData(dbFilename = 'chr22_allcounts.db', tablename = 'chr22')
head(dat)
getColnames(dat)
length(dat[,1]$pos)

nucleotide-level statistical analysis:

As described in the manuscript, at each nucleotide, we see if a covariate of interest is associated with expression (i.e., coverage) using a linear model with coverage as outcome. The code to do this is:

# define covariate:
sex = c(1,1,0,0,1,1,0,0,1,1,1,1,0,0,1)

# fit initial model (no shrinkage):
limma.input = getLimmaInput(dbfile = 'chr22_allcounts.db', tablename = 'chr22', group = sex)

# get positions where model was fit:
pos = limma.input$pos

# calculate moderated t-statistics and estimated fold changes:
tstats = getTstats(fit = limma.input$ebobject, trend=TRUE)
tt = tstats$tt
logfchange = tstats$logfchange

region-level statistical analysis

We now fit a Hidden Markov Model to the test statistics from each base-pair to get regions of similar differential expression signal:

params = getParams(tt)
regions = getRegions(method='HMM', chromosome='chr22', pos = pos, 
  tstats=tt, stateprobs=params$stateprobs, params=params$params,
  includet=TRUE, includefchange=TRUE, fchange=logfchange)
#### ^ may take some time
head(regions$states)
ders = subset(regions$states, state==3|state==4)
head(ders)

We can get region-level p-values using a permutation test (details in the manuscript). P-values can be adjusted for multiple testing (for false discovery rate control) by being converted to q-values.

pvals = get.pvals(regions=regions$states, dbfile='chr22_allcounts.db',
  tablename='chr22', num.perms=1000, group=sex, est.params=params,
  chromosome='chr22') #takes a while
regions.withp = data.frame(regions$states, pvals=pvals, qvals=p.adjust(pvals, 'fdr'))
head(reg.withp)

connecting results to annotation

The getAnnotation function allows users to download exon annotation information from UCSC, though changes to this database often break this function, so a more stable version will be available in the release version of the package. You can match regions to annotated exons using getExons:

exons = getAnnotation('hg19', 'knownGene')
chr22exons = subset(exons, chr=='chr22')
getExons(region=c('chr22', 60234591, 60234791), annotation=chr22exons)

visualizing results

You can also plot specific regions, exons, or genes using one of the available plotting functions. For example, to plot the second region in your regions object:

plotRegion(regions, ind=2, tstats=tt, pos=pos, annotation=chr22exons,
  counts='chr22_allcounts.db', group=ifelse(sex==1, 'male', 'female'),
  tabname='chr22', chromosome='chr22')

reproducing the manuscript's results

The DER Finder paper analysis code lives in this repository. To reproduce the analysis fully, you can follow these steps:

Clone this repository on to your local machine: git clone git@github.com:alyssafrazee/derfinder.git (for cloning over SSH)
Change to the data subdirectory and index each .bam file using samtools: e.g. samtools index orbFrontalF11_Y.bam (run for all samples).
Change back to the root directory (cd ..) and run the countReads.py script on each file in the data subdirectory. This may take some time. Here is the code:

python countReads.py -f data/orbFrontalF11_Y.bam -o orbFrontalF11_Y_bybp -k 101 -c Y

Again, run for each sample, replacing "orbFrontalF11_Y" with each different sample name.

Merge all of the resulting text files together using this R code:

chr = "Y"

# sample IDs
samps = c("orbFrontalF1", "orbFrontalF2", "orbFrontalF3", "orbFrontalF11", "orbFrontalF23", "orbFrontalF32", "orbFrontalF33", "orbFrontalF40", "orbFrontalF42", "orbFrontalF43", "orbFrontalF47", "orbFrontalF53", "orbFrontalF55", "orbFrontalF56", "orbFrontalF58")

# read in each sample:
countlist = list()
for(s in 1:length(samps)){
    print(paste("reading: sample",samps[s]))
    y = read.table(paste0(samps[s], "_Y_bybp"), sep="\t", header=FALSE)
    print("done reading.")
    countlist[[s]] = y$V2
    if(s==1) pos = y$V1
    if(s>1){
        if(length(y$V1)>length(pos)) pos = y$V1
    }
    rm(y);gc();gc();gc();gc()
  }

# put samples together and zero-pad as needed:
thelen = length(pos)
for(i in 1:length(countlist)){
    countlist[[i]] = c(countlist[[i]],rep(0,thelen-length(countlist[[i]])))
}
names(countlist) = samps
chr.table = as.data.frame(countlist)
chr.table = data.frame(pos, chr.table)
write.table(chr.table, file='tophatY-updated', row.names=FALSE, quote=FALSE, sep="\t")

This may also take some time, but will recreate tophatY-updated which is a dependency of the R code for the analysis.

Analysis can be run using the code in analysis_code.R

support

This version of derfinder is no longer actively maintained, since the official Bioconductor package is much more efficient. Please see the derfinder GitHub repo for support on that version.

leekgroup / derfinder

readme