differences in pvalues after updating methylSig

shawpa commented 4 years ago

I have been trying to work with the updated version of methylSig for the last couple weeks. I was reanalyzing data that I put through the old methylSig program and I'm noticing a very large discrepancy in pvalues. My understanding is the old methylSigCalc is the the same test as diff_methylsig. I know I shouldn't be going by the pvalue but I never had sites that were q0.1 significant or less before so I just had to use the pvalue cutoff instead.

When I used the pvalue cutoff of 0.05 in the old version, I got about 48K sites. The qvalues for those sites were all 1. In the new version, I got 272K sites when I filtered for pvalue 0.05. The FDR levels off at 0.8399.

If it is the same test by default, I am not sure why there is so much discrepancy. The only thing I know that was done differently was I kept the local window size 0. I used to set that at 200 but this doesn't seem to really work in the new version (I posted this on another issue page).

Obviously I know I have a lot of false positives but I guess I have to work with what I have. Are there any appropriate filtering recommendations you can make? In the manual, it states that for a large sample size I could switch to chi2. What is a "large" sample size.

rcavalcante commented 4 years ago

Hi,

I edited your comment above to make it a bit easier for me to read and locate the pertinent information. What I'm hearing is you ran the same data set through the old and new versions of methylSig and found:

In the old version, there are 48K loci with pval < 0.05.
In the new version, there are 272K loci with pval < 0.05.

You also mentioned "The only thing I know that was done differently was I kept the local window size 0." Since one test used local information and the other didn't, I'm not surprised the number of significant loci are different.

The operations to compute test statistics and significance are the same, yes (you said "If it is the same test by default"), but by doing one test with local information and the other without, the numbers going into the computations are different, so you're going to get different results.

I used to set that at 200 but this doesn't seem to really work in the new version (I posted this on another issue page).

I would agree that it takes longer, and I'll try to work on that, but it does work in the sense that it uses the local information.

Are there any appropriate filtering recommendations you can make?

Some other relevant information to have would be:

How many samples do you have total?
How many samples per group are there?
What min/max count filters are you using?
Are you destranding your CpGs?
How many samples per group are you requiring to have coverage?

In other words, it would be helpful to see the code that you're running to get the results you describe.

shawpa commented 4 years ago

This might be a very long post but I am just trying to be thorough and make sure I didn't make a mistake a long the way. I found a site that was the same between my old and new output to compare the pvalues. The old mydiffSig output is row 4. The new output is row 15. The first thing I noticed was the huge discepancy between the methylation average in cases and controls. I thought at first maybe the range was different (like comparing 0-1 v 0-100) but they can both go 1-100. The only other difference in my pipeline was that I used bsseq to import cytosine report files instead of using the bismark cov files and modifying the format myself. I wanted to rule that out that as the cause so I get the percent methylation values from each analysis for that site (rows 6 and 19) . They are identical except I had a couple of extra samples in the new output. So I don't think this is an error due to coverage files versus cytosine reports. But if you look at the averages in the old output, they align very well with what is reported in the myDiffSig output. In the new ones, they are way off.

Something is very wrong here and it may be on my end but I don't know. I just saw your response above and I will attach my code.

code_comparison.txt

comp_old_new.xlsx

rcavalcante commented 4 years ago

I would like to point out that this is really not a very clear example to try to figure out if a problem exists and what that problem might be. It is really important that the data be the same in the two instances and that the parameters be the same in the two instances. It is very hard to disentangle what is happening otherwise.

I will use some data that I have access to, and I will run it using the same filters and the same parameters, and I will report back my findings. I will say, prior to release, I did this, and didn't notice anything strange.

shawpa commented 4 years ago

I am sorry that my example is not clear. I am going to run it with the exact same parameters and filters. Please let me know if there is any data I can provide.

rcavalcante commented 4 years ago

I think this repository will get to the heart of the matter.

https://github.com/rcavalcante/methylsig_comparison

I have provided bedGraph files (same format as bismark cov files) in the bedgraphs folder.

The code for the old methylSig is in code/old.R and the code for the new methylSig is in code/new.R. At the top of each R file are commands to pull publicly available docker images that contain old/new versions of methylSig.

That code generates rdata objects in rda/ with the results of the old/new code running the main methylSig test both with and without local information.

The code to compare the data is in code/compare.R and it generates a bunch of comparison plots in the images/ folder.

I created images for:

Old versus new without local information
Old versus new with local information
Old without local information versus old with local information
New without local information versus new with local information

The key thing to take away is that the results are the same when comparing the same tests between old and new versions. You can see that the plots comparing old versus new without local information and old versus new with local information have points falling perfectly on the y = x line.

Another thing to note is that within old and new methylSig, comparing no local information to with local information produces different results. This is expected because the underlying methylation and dispersion estimates are changed with the local information.

I hope this clears up any doubts you may have had about the new and old versions. With that said, I agree that using local information has gotten slower and I'll try to find some time to fix that (I do have other priorities at work, but I am allowed some development time for this package).

If you feel that this resolves your issue, please go ahead and close it. And thanks for using methylSig.

shawpa commented 4 years ago

As an update, I ran my data with local information. Unfortunately, I feel like my results were worse because it showed that almost 1/2 my sites were significant using FDR cutoffs. Perhaps my issue is that I was using v0.4.4 (before granges) and now I am using 0.99.1. I am not doubting your data showing no difference.

I am attaching the differential methylation calculations (just 1000 rows) comparing my old output with the new output. Both are run with the exact same samples, same coverage cutoffs and using window size of 200.

mydiffold.txt mydiffnew.txt

rcavalcante commented 4 years ago

I will take a look at this in the next few days. I'll also run the data I included in the repo in methylSig v0.4.4. Sorry, I missed that particular detail.

rcavalcante commented 4 years ago

Hello,

I took a look at this in more depth over the weekend and found that the refactor to v0.5.0 caused some changes in the local calculation, which you've already noticed. The plots below are the same already in the https://github.com/rcavalcante/methylsig_comparison repo, but for v0.4.4 versus v0.5.0.

I tracked the change down two issues:

The v0.4.4 implementation iterates over the valid loci (according to min.per.group) and is allowed to borrow information from other loci that aren't going to be tested (i.e. don't meet min.per.group). Whereas the v0.5.0 refactor allows local information to be used for other loci meeting min.per.group. Put another way, the v0.5.0 implementation has a different and smaller universe of loci from which to borrow information than v0.4.4.
The v0.4.4 implementation has an additional check for valid loci from which to borrow information for dispersion estimates that wasn't included in the v0.5.0 refactor.

I am going to discuss this with the authors and will follow up shortly. The case where no local information is used remains the same from v0.4.4 to v0.5.0 to v0.99.4.

Thanks for bringing up this issue, it has been helpful to work through.

Raymond

meth_case

meth_control

meth_diff

pvalue

shawpa commented 4 years ago

Do you have any updates on this issue?

rcavalcante commented 4 years ago

Hi,

After discussing with the authors, we've decided that we prefer the current approach that methylSig is taking for local information. Namely, that it will only borrow information from loci that also pass the minimum per group threshold, as the data is "higher quality".

Going back to the message that began this thread:

I know I shouldn't be going by the pvalue but I never had sites that were q0.1 significant or less before so I just had to use the pvalue cutoff instead. When I used the pvalue cutoff of 0.05 in the old version, I got about 48K sites. The qvalues for those sites were all 1. In the new version, I got 272K sites when I filtered for pvalue 0.05. The FDR levels off at 0.8399.

I'm not entirely sure why so many more sites would be p < 0.05 without having the same data to examine. But I will say that it's typically a good idea to impose a methylation difference threshold on loci as well as a p-value/FDR threshold. This goes a long way to reducing false positives.

Now the choice of methylation difference threshold is dependent on the biology. In cancer/normal studies you can probably set it pretty high, but for subtler situations, such as with environmental exposures or complex diseases, lower thresholds (5-10%) are probably more appropriate.

All that said, this doesn't address the slow-down in using methylation information, and I will deal with that before the next release of Bioconductor in October.

Thanks, Raymond

shawpa commented 4 years ago

Thank you I really appreciate you and your team looking into this for me.

Annie

sartorlab / methylSig

differences in pvalues after updating methylSig #43