nskvir / RepEnrich

RepEnrich is a method to estimate repetitive element enrichment using high-throughput sequencing data.
26 stars 14 forks source link

There is currently a newer version of RepEnrich located here: https://github.com/nerettilab/RepEnrich2

RepEnrich

Tutorial By Steven Criscione

Email: steven_criscione@brown.edu

Dependencies

This example is for mouse genome mm9. Before getting started you should make sure you have installed the dependencies for RepEnrich. RepEnrich requires python version 2.7.3. RepEnrich requires: Bowtie 1, bedtools, and samtools. I am using bedtools version 2.20.1, bowtie 1 version 0.12.9, samtools version 0.1.19. RepEnrich also requires a bowtie1 indexed genome in fasta format available. (Example mm9.fa) The RepEnrich python scripts also use BioPython which can be installed with the following command:

pip install BioPython

IMPORTANT: bedtools version 2.24.0 and greater yield an error due to altered functionality of coverageBed.

Step 1) Attain repetitive element annotation

I have temporarily provided the setup for the human genome (build hg19 and hg38) and the mouse genome (build mm9) available here. After downloading you can extract the files using:

gunzip hg19_repeatmasker_clean.txt.gz
tar -zxvf RepEnrich_setup_hg19.tar.gz

To yield hg19_repeatmasker_clean.txt annotation file and RepEnrich_setup_hg19 setup folder. The annotation files I am using are repeatmasker files with simple and low-complexity repeats removed (satellite repeats and transposons are still present). If you choose to use these files for the set-up you can skip ahead to step 2.

The RepEnrich setup script will build the annotation required by RepEnrich. The default is a repeatmasker file which can be downloaded from repeatmasker.org, (for instance, find the mm9.fa.out.gz download here. Once you have downloaded the file you can unzip it and rename it:

gunzip mm9.fa.out.gz
mv mm9.fa.out mm9_repeatmasker.txt

This is what the file looks like:

SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end  (left) ID
687 17.4 0.0 0.0 chr1 3000002 3000156 (194195276) C L1_Mur2 LINE/L1 (4310) 1567 1413 1
917 21.4 11.4 4.5 chr1 3000238 3000733 (194194699) C L1_Mur2 LINE/L1 (4488) 1389 913 1
845 23.3 7.6 11.4 chr1 3000767 3000792 (194194640) C L1_Mur2 LINE/L1 (6816) 912 887 1
621 25.0 6.5 3.7 chr1 3001288 3001583 (194193849) C Lx9 LINE/L1 (1596) 6048 5742 3

The RepEnrich setup script will also allow you to build the annotation required by RepEnrich for a custom set of elements using a bed file. So if you want to examine mm9 LTR repetitive elements; you can build this file using the the repeatmasker track from UCSC genome table browser.

To do this, select genome mm9, click the edit box next to Filter, fill in the repclass does match with LTR, then click submit. Back at the table browser select option Selected fields from primary and related tables, name the output file something like mm9_LTR_repeatmasker.bed, and click Get output. On the next page select genoName, genoStart, genoEnd, repName, repClass, repFamily then download the file.

The UCSC puts a header on the file that needs to be removed:

tail -n +3 mm9_LTR_repeatmasker.bed | head -n -4 > mm9_LTR_repeatmasker_fix.bed
mv mm9_LTR_repeatmasker_fix.bed mm9_LTR_repeatmasker.bed

This is what our custom mm9 LTR retrotransposon bed file looks like:

$ head mm9_LTR_repeatmasker.bed

chr1 3001722 3002005 RLTR25A LTR ERVK
chr1 3002051 3002615 RLTR25A LTR ERVK
chr1 3016886 3017193 RLTRETN_Mm LTR ERVK
chr1 3018338 3018653 RLTR14 LTR ERV1

Note: It is important to get the column format right:

The file should be tab delimited. If there is no information on class or family, you can replace these columns with the repeat name or an arbitrary label such as group1.

Step 2) Run the setup for RepEnrich

Now that we have our annotation files we can move on to running the setup for RepEnrich. First load the dependencies (if you use Environment Modules - otherwise just make sure that these programs are available in your PATH).

module load bowtie
module load bedtools
module load samtools

Next run the setup using the type of annotation you have selected (default):

python RepEnrich_setup.py /data/mm9_repeatmasker.txt /data/mm9.fa /data/setup_folder_mm9

custom bed file:

python RepEnrich_setup.py /data/mm9_LTR_repeatmasker.bed /data/mm9.fa /data/setup_folder_mm9 --is_bed TRUE

The previous commands have setup RepEnrich annotation that is used in downstream analysis of data. You only have to do the setup step once for an organism of interest. One cautionary note is that RepEnrich is only as reliable as the genome annotation of repetitive elements for your organism of interest. Therefore, RepEnrich performance may not be optimal for poorly annotated genomes.

Step 3) Map the data to the genome using bowtie1

After the setup of the RepEnrich we now have to map our data uniquely to the genome before running RepEnrich. This is because RepEnrich treats unique mapping and multi-mapping reads separately. This requires use of specific bowtie options. The bowtie command below is recommended for RepEnrich:

bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq sample_A.fastq /data/sampleA_unique.sam

An explanation of bowtie options:

For paired-end reads the bowtie command is:

bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq -1 sample_A_1.fastq -2 sample_A_2.fastq /data/sampleA_unique.sam

The Sam file should be converted to a bam file with samtools:

samtools view -bS sampleA_unique.sam > sampleA_unique.bam
samtools sort sampleA_unique.bam sampleA_unique_sorted
mv sampleA_unique_sorted.bam sampleA_unique.bam
samtools index sampleA_unique.bam
rm sampleA_unique.sam

You should now compute the total mapping reads for your alignment. This includes the reads that mapped uniquely (sampleA_unique.bam) and more than once (sample_A_multimap.fastq). The .out file from your bowtie batch script contains this information (or stdout from an interactive job).

It should looks like this:

Seeded quality full-index
search: 00:32:26
    # reads processed: 92084909
    # reads with at least one reported alignment: 48299773 (52.45%)
    # reads that failed to align: 17061693 (18.53%)
    # reads with alignments suppressed due to -m: 26723443 (29.02%)
Reported 48299773 alignments to 1 output stream(s)

The total mapping reads is the # of reads processed - # reads that failed to align. Here our total mapping reads are: 92084909 - 17061693 = 75023216

Step 4) Run RepEnrich on the data

Now we have all the information we need to run RepEnrich. Here is an example (for default annotation):

python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --cpus 16

for custom bed file annotation:

python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --is_bed TRUE --cpus 16

An explanation of the RepEnrich command:

python RepEnrich.py 
    <repeat_annotation>
    <output_folder>
    <output_prefix>
    <RepEnrich_setup_folder>
    <multimapping_reads.fastq>
    <unique_mapping_reads.bam>
    (--is_bed TRUE)
    (--cpus 16)

If you have paired-end data the command is very similar. There will be two sampleA_multimap.fastq and sampleA_multimap_1.fastq and sampleA_multimap_2.fastq from the bowtie step.

The command for running RepEnrich in this case is (for default annotation):

python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --cpus 16 --pairedend TRUE

for custom bed file annotation:

python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --is_bed TRUE --cpus 16 --pairedend TRUE

Step 5) Processing the output of RepEnrich

The final outputs will be in the path /data/sample_A. This will include a few files. The most important of which is the sampleA_fraction_counts.txt file. This is the estimated counts for the repeats. I use this file to build a table of counts for all my conditions (by pasting the individual *_fraction_counts.txt files together for my complete experiment).

You can use the compiled counts file to do differential expression analysis similar to what is done for genes. We use EdgeR or DESEQ to do the differential expression analysis. These are R packages that you can download from bioconductor.

When running the EdgeR differential expression analysis you can follow the examples in the EdgeR manual. I manually input the library sizes (the total mapping reads we obtained in the tutorial). Some of the downstream analysis, though, is left to your discretion. There are multiple ways you can do the differential expression analysis. I use the GLM method within the EdgeR packgage, although DESeq has similar methods and EdgeR also has a more straightforward approach called exactTest. Below is a sample EdgeR script used to do the differential analysis of repeats for young, old, and very old mice. The file counts.csv contains the ouput from RepEnrich that was made by pasting the individual *_fraction_counts.txt files together for my complete experiment.

Example Script for EdgeR differential enrichment analysis

# EdgeR example

# Setup - Install and load edgeR
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library('edgeR')

# In the case of a pre-assembled file of the fraction count output do the following:
# counts <- read.csv(file = "counts.csv")

# In the case of seperate outputs, load the RepEnrich results - fraction counts
young_r1 <- read.delim('young_r1_fraction_counts.txt', header=FALSE)
young_r2 <- read.delim('young_r2_fraction_counts.txt', header=FALSE)
young_r3 <- read.delim('young_r3_fraction_counts.txt', header=FALSE)
old_r1 <- read.delim('old_r1_fraction_counts.txt', header=FALSE)
old_r2 <- read.delim('old_r2_fraction_counts.txt', header=FALSE)
old_r3 <- read.delim('old_r3_fraction_counts.txt', header=FALSE)
v_old_r1 <- read.delim('veryold_r1_fraction_counts.txt', header=FALSE)
v_old_r2 <- read.delim('veryold_r2_fraction_counts.txt', header=FALSE)
v_old_r3 <- read.delim('veryold_r3_fraction_counts.txt', header=FALSE)

#' Build a counts table
counts <- data.frame(
  row.names = young_r1[,1],
  young_r1 = young_r1[,4], young_r2 = young_r2[,4], young_r3 = young_r3[,4],
  old_r1 = old_r1[,4], old_r2 = old_r2[,4], old_r3 = old_r3[,4],
  v_old_r1 = v_old_r1[,4], v_old_r2 = v_old_r2[,4], v_old_r3 = v_old_r3[,4]
)

# Build a meta data object. I am comparing young, old, and veryold mice.
# I manually input the total mapping reads for each sample.
# The total mapping reads are calculated using the bowtie logs:
# # of reads processed - # reads that failed to align
meta <- data.frame(
    row.names=colnames(counts),
    condition=c("young","young","young","old","old","old","veryold","veryold","veryold"),
    libsize=c(24923593,28340805,21743712,16385707,26573335,28131649,34751164,37371774,28236419)
)

# Define the library size and conditions for the GLM
libsize <- meta$libsize
condition <- factor(meta$condition)
design <- model.matrix(~0+condition)
colnames(design) <- levels(meta$condition)

# Build a DGE object for the GLM
y <- DGEList(counts=counts, lib.size=libsize)

# Normalize the data
y <- calcNormFactors(y)
y$samples
plotMDS(y)

# Estimate the variance
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
plotBCV(y)

# Build an object to contain the normalized read abundance
logcpm <- cpm(y, log=TRUE, lib.size=libsize)
logcpm <- as.data.frame(logcpm)
colnames(logcpm) <- factor(meta$condition)

# Conduct fitting of the GLM
yfit <- glmFit(y, design)

# Initialize result matrices to contain the results of the GLM
results <- matrix(nrow=dim(counts)[1],ncol=0)
logfc <- matrix(nrow=dim(counts)[1],ncol=0)

# Make the comparisons for the GLM
my.contrasts <- makeContrasts(
    veryold_old = veryold – old,
    veryold_young = veryold – young,
    old_young = old – young,
    levels = design
)

# Define the contrasts used in the comparisons
allcontrasts = c(
    "veryold_old",
    "veryold_young",
    "old_young"
)

# Conduct a for loop that will do the fitting of the GLM for each comparison
# Put the results into the results objects
for(current_contrast in allcontrasts) {
    lrt <- glmLRT(yfit, contrast=my.contrasts[,current_contrast])
    plotSmear(lrt, de.tags=rownames(y))
    title(current_contrast)
    res <- topTags(lrt,n=dim(c)[1],sort.by="none")$table
    colnames(res) <- paste(colnames(res),current_contrast,sep=".")
    results <- cbind(results,res[,c(1,5)])
    logfc <- cbind(logfc,res[c(1)])
}

# Add the repeat types back into the results.
# We should still have the same order as the input data
results$class <- young_r1[,2]
results$type <- young_r1[,3]

# Sort the results table by the logFC
results <- results[with(results, order(-abs(logFC.old_young))), ]

# Save the results
write.table(results, 'results.txt', quote=FALSE, sep="\t")

# Plot Fold Changes for repeat classes and types
for(current_contrast in allcontrasts) {
  logFC <- results[, paste0("logFC.", current_contrast)]
  # Plot the repeat classes
  classes <- with(results, reorder(class, -logFC, median))
  par(mar=c(6,10,4,1))
  boxplot(logFC ~ classes, data=results, outline=FALSE, horizontal=TRUE,
          las=2, xlab="log(Fold Change)", main=current_contrast)
  abline(v=0)
  # Plot the repeat types
  types <- with(results, reorder(type, -logFC, median))
  boxplot(logFC ~ types, data=results, outline=FALSE, horizontal=TRUE,
          las=2, xlab="log(Fold Change)", main=current_contrast)
  abline(v=0)
}

Note that the objects logfc contains the differential expression for the contrast, logcpm contains the normalized read abundance, and result contains both the differential expression and the false discovery rate for the experimental comparison. I recommended reading more about these in the EdgeR manual.