hansenlab / bsseq

Devel repository for bsseq
36 stars 26 forks source link

getCoverage perBase perRegionAverage different outcomes #99

Closed jkrainer closed 4 years ago

jkrainer commented 4 years ago

Hello, first of all, thank you for this amazing package.

I have a questions regarding the getCoverage function. I want to look at the average methylation levels per samples for a specific region. Therefore I did the following:

ov <- findOverlaps(bs.object, region.of.interest)
per.region.av <- getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="M")/getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="Cov")

This results in the following average Methylation levels: 0.04347826 0.06896552 0.10000000 0.23255814 0.06666667 0.30434783 0.36734694 0.58333333 0.13461538 0.03571429 0.35135135 0.05555556

Looking at the methylation levels per CpG for this region (only 11 CpGs), I saw that the means of those values are not the same as the average Methylation levels above.

per.base.mean <- getCoverage(bs.object[queryHits(ov)], what="perBase", type="M")/getCoverage(bs.object[queryHits(ov)], what="perBase", type="Cov")
colMeans(per.base.mean, na.rm=T)

The above results in the following average Methylation levels: 0.03030303 0.04545455 0.17060213 0.25757576 0.07575758 0.30303030 0.36706349 0.58333333 0.13181818 0.03636364 0.44848485 0.05714286

How are the average Methylation levels for "perRegionAverage" calculated? Is there an explanation for the difference?

edit: I wrongly wrote colSums but I actually wanted to write colMeans edit2: edited the bs.object name

kasperdanielhansen commented 4 years ago

Thanks for the praise.

This is easy to explain: let us say that M and Cov are vectors giving us the numbers for a single region. There is a difference between sum(M) / sum(Cov) and mean(M / Cov)

It is not clear which is best I would say, but it;s clear there is a difference.

I think, for "raw" methylation values it makes sense to do sum(M) / sum(Cov) where a highly covered CpG contributes more. I think for "smooth" methylation values, mean(M/Cov) is probably better (and - well - you cannot really do sum(M) / sum(Cov) on the smoothed values)

If this is unclear from the help page on getRegion I am very happy to get suggestions for more text giving clarification on this.

Best, Kasper

On Thu, Oct 22, 2020 at 10:32 AM jkrainer notifications@github.com wrote:

Hello, first of all, thank you for this amazing package.

I have a questions regarding the getCoverage function. I want to look at the average methylation levels per samples for a specific region. Therefore I did the following:

ov <- findOverlaps(bs.object, region.of.interest) per.region.av <- getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="M")/getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="Cov")

This results in the following average Methylation levels: 0.04347826 0.06896552 0.10000000 0.23255814 0.06666667 0.30434783 0.36734694 0.58333333 0.13461538 0.03571429 0.35135135 0.05555556

Looking at the methylation levels per CpG for this region (only 11 CpGs), I saw that the means of those values are not the same as the average Methylation levels above.

per.base.mean <- getCoverage(bs_full[queryHits(ov)], what="perBase", type="M")/getCoverage(bs_full[queryHits(ov)], what="perBase", type="Cov") colSums(per.base.mean, na.rm=T)

The above results in the following average Methylation levels: 0.03030303 0.04545455 0.17060213 0.25757576 0.07575758 0.30303030 0.36706349 0.58333333 0.13181818 0.03636364 0.44848485 0.05714286

How are the average Methylation levels for "perRegionAverage" calculated? Is there an explanation for the difference?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/99, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH55LNN4UO2DITGOMGLSL7UY7ANCNFSM4S23TEIQ .

-- Best, Kasper

jkrainer commented 4 years ago

Thank you for your answer. Of course there is a difference between the mean and the sum. I edited my question because I actually wanted to write colMeans(per.base.mean, na.rm=T) but accidentaly wrote colSums(per.base.mean, na.rm=T). Sorry for the confusion.

kasperdanielhansen commented 4 years ago

Sorry, I was not reading clearly. I thought you were asking about getMeth.

I see you're using bs.object vs bs_full in the two calls? "perRegionAverage" doesn't really make sense when you don't have regions.

Perhaps make an example using the data in example(getCoverage)?

On Thu, Oct 22, 2020 at 10:41 AM Kasper Daniel Hansen < kasperdanielhansen@gmail.com> wrote:

Thanks for the praise.

This is easy to explain: let us say that M and Cov are vectors giving us the numbers for a single region. There is a difference between sum(M) / sum(Cov) and mean(M / Cov)

It is not clear which is best I would say, but it;s clear there is a difference.

I think, for "raw" methylation values it makes sense to do sum(M) / sum(Cov) where a highly covered CpG contributes more. I think for "smooth" methylation values, mean(M/Cov) is probably better (and - well - you cannot really do sum(M) / sum(Cov) on the smoothed values)

If this is unclear from the help page on getRegion I am very happy to get suggestions for more text giving clarification on this.

Best, Kasper

On Thu, Oct 22, 2020 at 10:32 AM jkrainer notifications@github.com wrote:

Hello, first of all, thank you for this amazing package.

I have a questions regarding the getCoverage function. I want to look at the average methylation levels per samples for a specific region. Therefore I did the following:

ov <- findOverlaps(bs.object, region.of.interest) per.region.av <- getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="M")/getCoverage(bs.object[queryHits(ov)], what="perRegionAverage", type="Cov")

This results in the following average Methylation levels: 0.04347826 0.06896552 0.10000000 0.23255814 0.06666667 0.30434783 0.36734694 0.58333333 0.13461538 0.03571429 0.35135135 0.05555556

Looking at the methylation levels per CpG for this region (only 11 CpGs), I saw that the means of those values are not the same as the average Methylation levels above.

per.base.mean <- getCoverage(bs_full[queryHits(ov)], what="perBase", type="M")/getCoverage(bs_full[queryHits(ov)], what="perBase", type="Cov") colSums(per.base.mean, na.rm=T)

The above results in the following average Methylation levels: 0.03030303 0.04545455 0.17060213 0.25757576 0.07575758 0.30303030 0.36706349 0.58333333 0.13181818 0.03636364 0.44848485 0.05714286

How are the average Methylation levels for "perRegionAverage" calculated? Is there an explanation for the difference?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/99, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH55LNN4UO2DITGOMGLSL7UY7ANCNFSM4S23TEIQ .

-- Best, Kasper

-- Best, Kasper

kasperdanielhansen commented 4 years ago

Like

getCoverage(hdf5_BS.chr22, what = "perRegionAverage") [1] 11.05917 13.32552 colMeans(getCoverage(hdf5_BS.chr22, what = "perBase")) r1 r2 11.05917 13.32552

On Thu, Oct 22, 2020 at 10:46 AM jkrainer notifications@github.com wrote:

Thank you for your answer. Of course there is a difference between the mean and the sum. I edited my question because I actually wanted to write colMeans(per.base.mean, na.rm=T) but accidentaly wrote colSums(per.base.mean, na.rm=T).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/99#issuecomment-714336097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH7VEXGI7RZ3G3NCOVLSL7WP7ANCNFSM4S23TEIQ .

-- Best, Kasper

jkrainer commented 4 years ago

No you were reading clearly, I wrote the wrong function (colSums) and changed it after I read your comment to colMeans. bs.object vs. bs_full are actually the same. I changed bs_full for this Issue to bs.object but only for the first half (changed it in the Issue now). In my actual dataset I have more than one region, but I see this behaviour for every region and subsetted it for this Issue to only one region.

Here is the Issue for the example data:

data(BS.chr22)
reg <- GRanges(seqnames = c("chr22", "chr22"),
               ranges = IRanges(start = c(1, 2*10^7), end = c(2*10^7 +1, 4*10^7)))

getCoverage(BS.chr22, what = "perRegionAverage", type="M")/bsseq::getCoverage(BS.chr22, what = "perRegionAverage", type="Cov") [1] 0.7691270 0.7683111 colMeans(getCoverage(BS.chr22, what = "perBase", type="M")/getCoverage(BS.chr22, what = "perBase", type="Cov"), na.rm=T) r1 r2 0.7789734 0.7777111

I hope this makes sense

kasperdanielhansen commented 4 years ago

Thanks, this is clear.

So this is the difference between sum(M) / sum(Cov) vs mean(M / Cov). One aspect of this is also how to treat CpGs with Cov == 0. When you have na.rm=TRUE they become NA when you do M/Cov and therefore get removed.

Anyway, to make it super clear:

M = getCoverage(BS.chr22, what = "perBase", type="M") Cov = getCoverage(BS.chr22, what = "perBase", type="Cov") colSums(M) / colSums(Cov) r1 r2 0.7691270 0.7683111 colMeans(M / Cov) r1 r2 NaN NaN colMeans(M / Cov, na.rm = TRUE) r1 r2 0.7789734 0.7777111

The call getCoverage(BS.chr22, what = "perRegionAverage", type="M") is equal to colMeans(M) because there is no regions giving so it just does the average of everything

(and - if you're confused - colSums(M) / colSums(Cov) is equal to colMeans(M) / colMeans(Cov) because 1/n cancels out

On Thu, Oct 22, 2020 at 11:01 AM jkrainer notifications@github.com wrote:

No you were reading clearly, I wrote the wrong function (colSums) and changed it after I read your comment to colMeans. bs.object vs. bs_full are actually the same. I changed bs_full for this Issue to bs.object but only for the first half (changed it in the Issue now). In my actual dataset I have more than one region, but I see this behaviour for every region and subsetted it for this Issue to only one region.

Here is the Issue for the example data:

data(BS.chr22) reg <- GRanges(seqnames = c("chr22", "chr22"), ranges = IRanges(start = c(1, 210^7), end = c(210^7 +1, 4*10^7)))

getCoverage(BS.chr22, what = "perRegionAverage", type="M")/bsseq::getCoverage(BS.chr22, what = "perRegionAverage", type="Cov") [1] 0.7691270 0.7683111 colMeans(getCoverage(BS.chr22, what = "perBase", type="M")/getCoverage(BS.chr22, what = "perBase", type="Cov"), na.rm=T) r1 r2 0.7789734 0.7777111

I hope this makes sense

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/99#issuecomment-714345509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4TUGRHLWVMJOSY35TSL7YIBANCNFSM4S23TEIQ .

-- Best, Kasper

jkrainer commented 4 years ago

Ok, now it is clear to me too! Thank you very very much.

kasperdanielhansen commented 4 years ago

It can be confusing, especially when you add in regions. There are slight differences in how zeroes are treated etc etc.

On Thu, Oct 22, 2020 at 11:14 AM jkrainer notifications@github.com wrote:

Ok, now it is clear to me too! Thank you very very much.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hansenlab/bsseq/issues/99#issuecomment-714353434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH7JPHEQTA4RDG3EGWDSL7ZYTANCNFSM4S23TEIQ .

-- Best, Kasper