shenlab-sinai / ngsplot

Quick mining and visualization of NGS data by integrating genomic databases
Other
252 stars 65 forks source link

Reads mapping to mitochondrial genes are not counted when BAM file uses Ensembl chromsome names #59

Open jdblischak opened 8 years ago

jdblischak commented 8 years ago

ngsplot only accepts BAM files where the chromosomes either all start with "chr" or none start with "chr". This has caused problems in the past, e.g. Issues #25 and #39, and I demonstrate below that the current implementation has trouble with reads mapping to mitochondrial reads.


From my quick inspection of the code, it is my understanding that the purpose of checking for the presence of "chr" at the beginning of the chromosome names is to account for the differences in how databases name them.

For example, the annotation data for hg19 uses the UCSC naming scheme.

load("database/hg19/hg19.ensembl.genebody.protein_coding.RData")
table(genome.coord$chrom)
## 
##  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19  chr2 
## 13096  4519  9733  9072  1794  5490  5183  7676 10122  2467 10709  9912 
## chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrM  chrX 
##  2983  1475  3269  8883  5438  6540  6014  6707  5525  4545    13  4163 
##  chrY 
##   203

On the other hand, the [example BAM file]() provided for testing uses the Ensembl naming scheme.

samtools view example.bam/hesc.RNAseq.1M.bam | cut -f3 | sort | uniq -c
##   79105 1
##   22307 10
##   47344 11
##   65155 12
##   11966 13
##   21653 14
##   27158 15
##   41635 16
##   48328 17
##    9010 18
##   65128 19
##   41781 2
##   21366 20
##    9584 21
##   21701 22
##   42862 3
##   23356 4
##   34332 5
##   51076 6
##   36167 7
##   25598 8
##   29591 9
##  100662 MT
##   27058 X
##    3250 Y

My understanding of the current solution is that the function chrTag checks for the presence of "chr" at the beginning of the chromosomes. If it detects "chr", it sets chr.tag to TRUE. Here is the function.

chrTag <- function(sn.inbam) {
# Check whether the chromosome name format in the bam file contains 'chr' or not.
# Args:
#   sn.inbam: seqnames in the bam file.

    n.chr <- length(grep('^chr', sn.inbam))
    if(n.chr == 0) {
        chr.tag <- F
    } else if(n.chr == length(sn.inbam)) {
        chr.tag <- T
    } else {
        return("Inconsistent chromosome names in bam file. Check bam header.")
    }

    chr.tag
}

chr.tag is then passed to the function doCov. If it is TRUE, the leading "chr" is removed from the chromosome names in the annotation file. Here is the relevant code:

    v.chrom <- coord.mat$chrom
    if(!chr.tag) {
        v.chrom <- sub('chr', '', v.chrom)
    }
    v.chrom <- as.factor(v.chrom)

This workaround fixes the naming for the majority of the chromosomes. However, it does not fix the name of the mitochondrial sequence. UCSC uses "chrM", and Ensembl uses "MT".

To confirm this problem, I created a list of the mitochondrial genes.

(mito <- genome.coord$gid[genome.coord$chrom == "chrM"])
##  [1] "ENSG00000198899" "ENSG00000228253" "ENSG00000198804"
##  [4] "ENSG00000198712" "ENSG00000198938" "ENSG00000198727"
##  [7] "ENSG00000198888" "ENSG00000198763" "ENSG00000198840"
## [10] "ENSG00000198886" "ENSG00000212907" "ENSG00000198786"
## [13] "ENSG00000198695"
cat(mito, file = "mito-genes.txt", sep = "\n")

I then run ngsplot on the example BAM file for only these genes.

NGSPLOT=~/repos/ngsplot bash -c 'bin/ngs.plot.r -G hg19 -R genebody -C example.bam/hesc.RNAseq.1M.bam -O mito-fail -F rnaseq,protein_coding -E mito-genes.txt'
## Configuring variables...
## Using database:
## /home/jdb-work/repos/ngsplot/database/hg19/hg19.ensembl.genebody.protein_coding.RData
## Done
## Loading R libraries.....Done
## Analyze bam files and calculate coverageWarning message:
## 'isNotPrimaryRead' is deprecated.
## Use 'isSecondaryAlignment' instead.
## See help("Deprecated") 
## .Done
## Plotting figures...Done
## Saving results...Done
## Wrapping results up...Done
## All done. Cheers!

Because the chromsome names do not match, there is zero enrichment.

unzip("mito-fail.zip")
mito_fail_avprof <- read.table("mito-fail/avgprof.txt", header = TRUE)
summary(mito_fail_avprof)
##      Noname 
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0

Now creating a new BAM file with the mitochondrial chromosome converted from MT to M (inspired by this Biostars post).

samtools view -H example.bam/hesc.RNAseq.1M.bam | sed -e 's/SN:MT/SN:M/' | samtools reheader - example.bam/hesc.RNAseq.1M.bam > example.bam/hesc.RNAseq.1M.mito.bam
samtools index example.bam/hesc.RNAseq.1M.mito.bam
samtools view example.bam/hesc.RNAseq.1M.mito.bam -c M
## 100662
samtools view example.bam/hesc.RNAseq.1M.bam -c MT
## 100662

Re-running ngsplot with the updated chromosome name.

NGSPLOT=~/repos/ngsplot bash -c 'bin/ngs.plot.r -G hg19 -R genebody -C example.bam/hesc.RNAseq.1M.mito.bam -O mito-pass -F rnaseq,protein_coding -E mito-genes.txt'
## Configuring variables...
## Using database:
## /home/jdb-work/repos/ngsplot/database/hg19/hg19.ensembl.genebody.protein_coding.RData
## Done
## Loading R libraries.....Done
## Analyze bam files and calculate coverageWarning message:
## 'isNotPrimaryRead' is deprecated.
## Use 'isSecondaryAlignment' instead.
## See help("Deprecated") 
## .Done
## Plotting figures...Done
## Saving results...Done
## Wrapping results up...Done
## All done. Cheers!

Now that the chromosome names match, there is enrichment.

unzip("mito-pass.zip")
mito_pass_avprof <- read.table("mito-pass/avgprof.txt", header = TRUE)
summary(mito_pass_avprof)
##      Noname    
##  Min.   : 253  
##  1st Qu.:1312  
##  Median :1565  
##  Mean   :1715  
##  3rd Qu.:2061  
##  Max.   :4414

Instead of removing "chr" from the chromosome names in the annotation file, what if instead ngsplot converted the chromsome names in the BAM file from Ensembl to UCSC? Maybe this could be a command-line option? I could help try to implement something to fix this if you could give me some direction on what you would be willing to accept as a PR.

lishen commented 7 years ago

Can this problem be solved with the new 2.63 release? I have abandoned the chromosome check which is often problematic. Users are now responsible for making sure their bam files contain the same chromosome names as their references.