smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
66 stars 27 forks source link

p-values and counting methylated sites in DMR #115

Closed martinjvickers closed 6 years ago

martinjvickers commented 6 years ago

I'm a bit confused by what is happening with the radmeth part of section 3.2.2 in the methpipe-manual. I've been trying to run DM analysis on some data with three reps of each control/case.

Firstly, in the manual it mentions;

Individual differentially methylated sites: After completing the previous steps, individual differentially methy- lated sites can be obtained with ‘awk‘. To get all CpGs with FDR-corrected p-value below 0.01, run $ awk ’$5 < 0.01 "{ print $0; $}"’ cpgs.adjusted.bed > dm_cpgs.bed

However column 5 ($5) is the original p-value, column 7 is the FDR-corrected p-value.

The first five columns and the last four columns of radmeth adjust have the same meaning as those output by radmeth regression. The 6-th column gives the modified p-value based on the original p-value of the site and the p-values of its neighbors. The 7-th column gives the FDR-corrected p-value.

This leads me to where I'm getting confused. All my analysis has gone well until the radmeth merge step, specifically relating to the number of sites significantly methylated within the DMR. I have the following DMRs created using;

radmeth adjust -bins 1:200:1 cpgs.bed > cpgs.adjusted.bed
radmeth merge -p 0.01 cpgs.adjusted.bed > dmrs.bed
1    62012    62049    dmr    1    0.200759
1    87097    87178    dmr    1    -0.0200552
1    120284    120410    dmr    2    -0.138928

As you can see the number of significantly differentially methylated CpGs in this case is low (1 or 2). When looking at the sites that contribute to each of these DMRs I can see that rather than the FDR-corrected p-value being used to determine the number of significantly differentially methylated CpGs, it's the original p-value.

1     62012   -       CG      0.00828048      0.000786923     0.00865109      34      18      35      4
1     62042   +       CG      0.168587        0.000443204     0.00688977      15      15      10      9
1     62043   -       CG      0.147321        0.000443204     0.00688977      31      30      42      33
1     62047   +       CG      0.730155        0.000443204     0.00688977      16      15      10      9
1     62048   -       CG      0.275849        0.000443204     0.00688977      31      30      40      34
1     87097   +       CG      0.00768289      0.000157625     0.00483611      105     105     141     133
1     87098   -       CG      0.36221 0.000157625     0.00483611      18      17      9       9
1     87110   +       CG      0.284016        0.000102645     0.00427993      94      72      141     121
1     87111   -       CG      0.237685        0.000102645     0.00427993      20      12      13      11
1     87122   +       CG      0.583184        0.000227676     0.0054376       102     60      125     78
1     87123   -       CG      0.623292        0.000296443     0.00596351      18      12      12      9
1     87160   +       CG      0.521007        0.00055832      0.00753983      98      17      120     16
1     87161   -       CG      0.0552693       0.00055832      0.00753983      9       2       12      0
1     87177   +       CG      0.131208        0.000239744     0.00553684      113     18      109     9
1     120284  +       CG      0.000977469     0.000201554     0.00522071      51      35      54      52
1     120285  -       CG      0.488941        0.000201554     0.00522071      51      49      59      55
1     120408  +       CG      0.0363937       0.00026876      0.00576065      42      35      28      28
1     120409  -       CG      0.00150726      0.000134775     0.00461846      51      43      57      57

Is this the way it's supposed to be, shouldn't it be using the FDR-corrected p-value to determine the number of significantly differentially methylated sites?

Many thanks.

bdecato commented 6 years ago

Hi @martinjvickers,

Thanks for using our pipeline. radmeth merge uses uncorrected p-values to create a composite p-value for the DMR using the Liptak-Stouffer method for combining probabilities (See @egor-dolzhenko 's paper https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-215). Egor may be able to give more insight into the appropriateness of applying FDR before versus after the merge step, but my intuition tells me that by combining probabilities you're reducing the overall number of tests you're analyzing downstream, and therefore FDR after combining should be fine.

More generally, I tend to avoid the radmeth merge function because I believe it gives very strict results (typically the DMRs are quite small: only a few CpG sites in length at most) and because of this ambiguity about where to apply FDR. Rather, I prefer to take two alternate approaches to identifying DMRs using radmeth:

1) I calculate the average methylation in a set of genomic features of interest (such as CpG islands or gene promoters) and treat them like CpG sites as input to radmeth, where any "CpG site" called as differential is actually a region, or 2) I call DM CpGs using radmeth at single-CpG resolution and then count the number and direction of the significant CpGs (using the FDR-corrected p-value) in my regions of interest, ranking them by the number of significant CpGs in the region.

I hope this was helpful!!

Ben

egor-dolzhenko commented 6 years ago

Thanks for the question, Martin. You observed correctly that the CpG count reported by radmeth merge is based on the original, unmerged p-values. While the significance of a region is assessed using the combined p-values, knowing the number of CpGs within a DMR region that exhibit differential methylation is a useful metric for stratifying significant DMR. It helps to separate regions where each CpG exhibits modest amount of differential methylation from those where many CpGs individually exhibit strong differential methylation (both types of regions could lead to significant merged p-values).

martinjvickers commented 6 years ago

Hi @bdecato and @egor-dolzhenko thank you very much for those clarifications, that was very useful.