lamortenera / bamsignals

R package to quickly obtain count vectors from indexed bam files
13 stars 1 forks source link

bamCoverage coverage ccalculation results differ from IGV and deeptools bamCoverage #24

Open hagen-wende opened 1 month ago

hagen-wende commented 1 month ago

I want to use bamcoverage to produce some publication quality graphs with R. I got a bit puzzled how coverage is actually calculated, because it seems to be higher than what I see in IGV and also what bamcoverage from deeptools calculates. Furthermore it seems to be smoothed, but I would like to have 1 bp resolution and than smooth it with roll_mean myself.

Here is an example: IGV grafik bamcoverage (deeptools) grafik and this is a df from bamsignals bamcoverage grafik

This is my code to get there:

bampath <- "./mybam.bam"

annotation.file <- "./genomic_ranges.csv"
genomic_ranges <- readGeneric(annotation.file, chr = 1, start = 2, end = 3, strand = 4,
                    meta.cols = NULL, keep.all.metadata = FALSE, zero.based = FALSE,
                    remove.unusual = FALSE, header = TRUE, skip = 0, sep = ",")

covSigs <- bamCoverage(bampath, genomic_ranges, verbose=TRUE)

genomic_range_ID <- 1

#generate tibble from data
df <- tibble(x = seq(start(genomic_ranges)[genomic_range_ID],end(genomic_ranges)[genomic_range_ID]), y = covSigs[genomic_range_ID])

In other words how cn I get to the "real" coverage with 1 bp resolution? Best Hagen

your-highness commented 1 month ago

Dear @hagen-wende ,

Thanks for using bamsignals :)

When using bamCoverage() with default options all reads incl. MAPQ0 reads are counted for the whole read span. You can adapt this by setting mapqual=<threshold>. I think this explains the discrepancy with IGV because it does not use MAPQ0 reads for coverage estimation per default. When using mapqual=1 it should correspond to the IGV coverage estimation. Please see https://www.bioconductor.org/packages/release/bioc/manuals/bamsignals/man/bamsignals.pdf for a documentation of options - also the options tlenFilter and filteredFlag can have an impact.

Best

hagen-wende commented 1 month ago

thanks for your super swift reply, actually the mapqual=1 does not change anything, probably because I filter the reads with samtools for -q 20, so those reads should not be in the bam.

I also did some more tests and I can't really explain what is going on

So this is the coverage in IGV (range in y is 0-40) grafik

This what I get with bamsignal grafik

Now I also tried it with Rsamtools with this code

library(Rsamtools)
library(GenomicRanges)
library(GenomicAlignments)

# Load the BAM file
bam_file <- BamFile("./my_bam.bam")

annotation.file <- "./genomic_ranges.csv"

genomic_ranges <- readGeneric(annotation.file, chr = 1, start = 2, end = 3, strand = 4,
                              meta.cols = NULL, keep.all.metadata = FALSE, zero.based = FALSE,
                              remove.unusual = FALSE, header = TRUE, skip = 0, sep = ",")

# Define the genomic range of interest
region <- genomic_ranges[1]

# Extract reads from the BAM file for the specified region
param <- ScanBamParam(which = region)
aln <- readGAlignments(bam_file, param = param)

# Calculate the coverage at 1 bp resolution
cov <- coverage(aln)

# Extract coverage for the specific region
chr_cov <- cov[["chr8"]]
region_cov <- chr_cov[36785298:36788019]

# Plot the coverage
plot(36785298:36788019, as.numeric(region_cov), type = "l", xlab = "Position", ylab = "Coverage")

and I get this: grafik

So this is the same as with bamsignals, both are higher than IGV and look smoothed. I already checked whether I'm really using the same bam file, but that seems to be the case. So it could be an issue with the Granges object???

I'm really puzzeled ...

your-highness commented 1 month ago

Dear @hagen-wende ,

The IGV coverage plots does subtract low BASEQ bases, insertions and deletions which could result in a more erratic profile. I think you could test using 1bp-Genomic Ranges and work with bamProfile() (only counting canonical read starts) and modify the numeric vector with your own routine. bamCoverage() will be counting every position of a mapped read towards coverage estimation (when using paired end it is possible to get the whole fragment coverage).

  1. Are your GRanges in 1bp resolution?
  2. Is this paired end data?
hagen-wende commented 1 month ago

Hi @your-highness (kind of funny),

that is actually a good point with indels. This is nanopore long reads, which are very noisy with respect to indels. So the CIGAR information is not taken into account, which explains the difference to IGV. Actually when I count the reads I end up with 30 at the 3'end, where igv is reporting 26. I think I have to look into this a bit more, might actually be what I want. I just have to see how long deletions (1kb) are counted for long reads, because those should not be considered as part of the coverage. Is my Granges in 1 bp? I don't know (haven't really worked with GRanges so far), at least when I export it to a df than I get data for every single base.

This is how the alignment actually looks (should have shown this in the beginning). grafik

So I think this is more me not understanding how the coverage calculations work. But just in case wouldd you know any CIGAR aware coverage calculator (not sure whether I actually need it).

Thansk for your help! Much appreciated ...

hagen-wende commented 1 month ago

So unfortunately this is not working for me, because in this example, I don't want the deletion in one of the alleles to count towards the coverage ...

This is IGV: image

and this is bamCoverage from bamsignals: image

your-highness commented 1 month ago

Dear @hagen-wende ,

Sorry for the delay! I think you correctly identified the problem:

Bamsignals ist not considering Deletions towards coverage estimation and assumes all bases are covered between start and end points of an alignment which is a matter of definition what a read coverage actually is.

I don't see any quick fix on how to deal with this but one could possibly split reads as soon as a D character is found in the CIGAR into two "synthetic" reads (1x downstream and 2x upstream the deletion). I think adaption should be done in this function: https://github.com/lamortenera/bamsignals/blob/4c6bbe209c23fc61b1d073d4cd4bcde5ed3fb9be/src/bamsignals.cpp#L92-L135

I am sorry that I am of not more help but I can not implement this feature in the near future. But you are more than welcome to add a Pull Request.

hagen-wende commented 1 month ago

thanks for your reply. At least we figured out what is going on.