mskilab-org / bamUtils

3 stars 4 forks source link

Incorrect 3' end coordinates imported using `bamUtils::read.bam` #11

Open mlweilert opened 4 months ago

mlweilert commented 4 months ago

To our knowledge, bamUtils doesn't correctly import the 3' of a .bam file due to 0-based .bam files not correctly being converted to a 1-based GRanges object.

The start of any read imported through bamUtils is correct, but the base of the end of the read is off by one base upstream. This is true no matter the strand orientation of the aligned read. Reads seem to be imported from 5' to 3' even if they are mapped to the reverse strand of the genome. The orientation is stored in the flag. So when a read maps on the reverse strand, it takes the end coordinate, which still is off by one base. When a read map on the sense strand, it takes the start which is correct.

We would suggest re-checking the conversion of 0-based .bam coordinates to 1-based GRanges coordinates. In the case of our applications looking transcription initiation sites, this could be a meaningful bug in many applications.

Below is a sample of the code we used for bamUtils as well as a screenshot showing the result of using an aligned .bam file, applying bamUtils:read.bam, and the resulting .bw file, which is off by one base.

reads <- bamUtils::read.bam(bam = opt$input_bamfile, 
                              intervals = chrom_sizes_gr[seqnames(chrom_sizes_gr) == chromosome],
                              bai = paste(opt$input_bamfile, ".bai", sep = ""), 
                              pairs.grl = FALSE,
                              stripstrand = TRUE,
                              what = scanBamWhat(),
                              unpack.flag = FALSE,
                              verbose = FALSE,
                              tag = NULL,
                              isPaired = NA,
                              isProperPair = NA,
                              isUnmappedQuery = NA,
                              hasUnmappedMate = NA,
                              isNotPassingQualityControls = FALSE,
                              isDuplicate = NA,
                              pairs.grl.split = FALSE,
                              ignore.indels = TRUE, all = TRUE)
  reads <- GenomicRanges::resize(x = reads, width = 1, fix = 'start')

#Exported the file to coverage object and .bigwig after this

image

SimonBourdareau commented 4 months ago

To complement what Melanie is saying, the corresponding line in the source code is likely: out$pos2 <- out$pos + pmax(rowSums(cigs[, c("D", "M", "N"), drop = FALSE], na.rm=T) - 1, 0)