Confusion using methylSigCalc: min.per.group

bmreilly commented 8 years ago

apologies in advance if this is not the right forum, but I could not find a better way to contact you about this issue.

I am confused on what the "min.per.group" option is doing in the methylSigCalc function. I was under the impression that this option set the minimum number of samples with adequate coverage a site must have to be included in the differential methylation calculation. Do I have that correct?

I am confused because when I am comparing datasets with higher number of samples, I end up with less sites included in the diffmeth calculation, where I would expect more sites included due to having more samples to choose from to meet the min.per.group=3 threshold.

Direct example: comparison 1: treatment group, n = 6 control group, n = 70 number of sites included = ~180,000 comparison 2: treatment group, n = 6 control group, n = 10 number of sites included = ~300,000

Have I misunderstood the purpose of the min.per.group option?

rcavalcante commented 8 years ago

This is a fine forum to discuss the issue, thanks for bringing this to our attention. You are correct about the intended behavior of min.per.group and it is odd that it behaves in that way.

Can you share with us the code you're using to run the analyses? Also, which version of methylSig are you using? If you'd prefer not to provide the code here, you can email it to rcavalca [at] umich.edu.

bmreilly commented 8 years ago

I am using this version of methylSig: Version: 0.1.12 Date: 2015-04-26 12:15 R version 3.2.2 (2015-08-14) -- "Fire Safety"

The code is pretty long, so I will email you the full code. Here is the portion that you might be able to spot a simple error in:

library(methylSig)
cores=10
minreads=10
maxreads=500
fileList = c("/path/to/file1.methylScore", "/path/to/file2.methylScore")
sample.id = c("samp1", "samp2")
treatment = c(0,1)

methData <- methylSigReadData(fileList, sample.ids=sample.id, assembly="hg19",
                          treatment=treatment, context="CpG", destranded=TRUE,
                          minCount=minreads, maxCount=maxreads, num.cores=cores)

methDiff <- methylSigCalc(methData, groups=c(Treatment=0, Control=1),
                          min.per.group=c(2,2), num.cores=cores, dispersion="both",
                          local.disp=F)

Thanks, look out for my email.

rcavalcante commented 8 years ago

Thanks for the additional information. To be clear, are the treatment samples the same in the two comparisons? And are the 10 control group samples in comparison 2 among the 70 control group samples in comparison 1?

rcavalcante commented 8 years ago

The only bug that we've fixed since v0.1.12 that may have a bearing on this was a problem with destranded=TRUE that didn't correctly recover sites based on the given minCount. As per your email:

I've done a bit more investigating, and it looks like when I go from raw methylscore files to methData, the number of included sites decreases quite a bit. When I checked the raw methylScore files with an awk script each file had > 500,000 sites that passed the > 10x < 500x filters, so somehow the methlSigReadData function is reducing the number of sites included, and I'm not sure why.

I would bet this destranding bug is causing the discrepancy you're seeing. A quick thing to try would be to update methylSig:

devtools::install_github('sartorlab/methylSig')

and see if that helps. If not, we can try other things.

bmreilly commented 8 years ago

Yes, the treatment samples in comparison 1 are the same as in comparison 2, the control samples are different for comparison 1 and 2. I just updated to the newest version, will let you know if that fixes it.

bmreilly commented 8 years ago

Do you think setting destranded=FALSE would help here? My pre-methylSig pipeline already destranded the data to generate the methylScore files.

rcavalcante commented 8 years ago

I don't think changing to destranded=FALSE will have an effect on already destranded data.

I just did a check with toy data that is similar to your data in that all data is on one strand ("destranded") and no sites are filtered by minCount or maxCount (pre-filtered). The result of methylSigReadData() was as expected.

bmreilly commented 8 years ago

Well I re-ran it using the updated version and both destranded=F and destranded=T and both yielded the same result as before.

After reading all of the files in I got this message: Sites > maxCount or < minCount: 9607295 / 9860719 = 0.974 This means that 97.4% of sites were not included right?

Is it possible that the context="CpG" option is causing my issue? I had to go through several coordinate changes to get my data into the methylScore format, is it possible that I have the bed coordinates off by 1 base and this is causing an error with the context option?

rcavalcante commented 8 years ago

The context parameter isn't used in the analyses or filtering.

I got very suspicious when I saw 97.4% of your sites were getting filtered out. I am pretty sure what the problem is. I fixed a bug related to destranding and the minCount/maxCount (060381bcff813fd2c550efcf9f5087e701788aee), but that fix made it so any site failing minCount/maxCount was removed from all samples, even if it failed in 1 of 76 samples, as in your case.

Increasing the number of samples increases the chance of a site being removed by the minCount/maxCount filter, resulting in fewer tested sites when you add in more samples.

I'll work on a fix for this.

bmreilly commented 8 years ago

Ahhh, that explains it. I look forward to seeing how this works when its fixed.

Thanks again for looking into it.

rcavalcante commented 8 years ago

Thanks for your patience. PR #15 should fix this issue.

Try your analysis with v0.4.3 and let me know if you're still seeing fewer sites tested with more samples. If it doesn't fix the problem, we can reopen the issue.

sartorlab / methylSig

Confusion using methylSigCalc: min.per.group #14