Invalid application of p-value combination methods for DMR ranking in dmrcate

I have just started exploring use of DMRcate for DMR detection but have concerns about the approaches dmrcate offers for ranking DMRs. DMRcate offers Fisher's combination, Stouffer's combination, and the harmonic mean FDR as statistical measures for ranking DMRs based on their BH-adjusted p-values. However, from the source code, it appears that these p-value combination methods are applied to the FDR values (q-values) rather than to the raw p-values, but this is statistically invalid. The uniformity of p-values under the null hypothesis is a fundamental assumption required by all of these p-value combination methods, but adjusted p-values, such as BH q-values, are not valid p-values (they are not uniformly distributed under the null hypothesis). Therefore, the only way to meaningfully combine p-values while still controlling the FDR (or FWER) is to first combine the raw p-values and then adjust the combined p-values, e.g. using the BH method. Even then, the adjusted Fisher and Stouffer "p-values" will not technically be correct because the spatial correlation of CpG methylation breaks the independence assumption of Fisher's and Stouffer's methods, which in turn means that the combined p-values will not necessarily satisfy the assumptions (independence, and uniformity under the null) required by the BH procedure. Only the harmonic mean p-value would technically be suitable. This last technicality aside, can you modify the dmrcate code to combine the raw p-values before adjusting to control the FDR?

It would also be great if dmrcate could (optionally) output more of its intermediate results (such as raw p-values, or at least the combined raw p-values per DMR before BH adjustment). Outputting more of the intermediate stats (especially at the level of individual CpGs, grouped by DMR, if possible) would offer users more flexibility in how they use DMRcate's results.

Some other comments:

I like that dmrcate offers an ANOVA mode, but there is no obvious way to perform post-hoc testing to subset ANOVA DMRs to subsequently identify DMRs associated with specific pairwise comparisons. The ability to do this would be a really useful feature.
It would be nice if dmrcate could output group means, and maybe variances, of beta values when run in ANOVA mode. A ranking that incorporates beta values as well as statistical significance would be ideal (especially when many p-values end up being reported as 0).
I suspect some issues with numerical precision of the HMFDR for small FDR values. I ran an analysis that yielded many "0" min_smoothed_fdr, Fisher, and Stouffer values, but the HMFDR values were all nonzero (albeit tiny). This is a bit puzzling because, if any of a set of values are 0, their harmonic mean should also be 0. Precision errors due to divisions by 0 in calculation of the harmonic mean, maybe? Perhaps it would be better to use an if/else block to set the harmonic mean to 0 if any of the values are 0.

Thanks for considering my points, and otherwise for a great tool in DMRcate!

Hi Josh,

Many thanks for looking into this. You are absolutely right: there are many statistical violations I've made in applying Fisher, Stouffer and HMFDR post-hoc to the DMR calls. My only defence is that I can't think of an appropriate way of summarising the significance of a de novo-called DMR, constituted from a clump of irregularly spaced CpGs, without generating a reliable background / null distribution that doesn't take forever to generate computationally. Brent Pedersen's comb-p (https://academic.oup.com/bioinformatics/article/28/22/2986/240603) is the closest I can think of, but there is no R implementation and the empirical significance calculation is still time-consuming, especially with WGBS where you have tens of millions of CpGs.

When I first wrote DMRcate it was very much with a "give the people what they want'' attitude, i.e. a ranked list of DMRs, each with a single significance value, just as with gene expression. The more I ran it, the more I realised it wasn't about the significance of individual DMRs, but collectively a set of discontinuous regions where the contiguous kernel estimate exceeded a certain genome-wide threshold. Because of this, the most important controlling parameter is actually the fdr argument in cpg.annotate(). This will determine the genome-wide threshold (based on the distribution of limma-generated FDRs) at which the kernel will be "clipped", and thus the endpoints of DMRs. So your significant result isn't actually a list of individual DMRs, it is first and foremost a group of regions that exceed a given cutoff, should be reported as such. The Fisher, Stouffer and HMFDRs (wrongly calculated as they are, and I'll incorporate your correction in the next release) are afterthoughts and only provide a ranking mechanism by which the results can somewhat resemble a DE gene output. The real result is your DMR set.

A couple of extra points:

You can actually model pairwise comparisons from a multifactorial model as post-hoc contrasts when analysis.type="differential", specifying contrasts=TRUE and providing a contrast matrix to the cont.matrix argument in cpg.annotate().
ANOVA mode is only really there to test the performance of various models and nothing more - basically seeing if the resulting F-distribution confers more or fewer DMRs. All proper hypothesis testing should be done with analysis.type="differential" and correct specification of coef for the coefficient of interest.
The HMFDR thing is odd - non-zero p-values tap out at about 5e-324. Can you confirm your zeros are real zeros? You can alter the number of digits for floating points - this may be relevant: https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.

Thanks again, and well spotted ;)

Cheers, Tim

Hi Tim,

Thanks for your quick response! I appreciate your argument for the set of DMRs being the most important output of DMRcate -- and for most of my planned downstream analyses, the set of DMRs is more important than any ordering of the DMRs anyway. That said, a meaningful ranking can be very helpful -- for example, to decide which DMRs to plot, to correlate promoter DMR ranking with a differential gene expression ranking, and to identify a small subset of DMRs to consider for targeted methylation profiling of cfDNA in liquid biopsies. Plus, even with your genome-wide cutoff, there is a natural ranking: as you decrease the cutoff fdr, you will lose DMRs from the set, although I realise that the DMRs will also shrink in size and possibly also fragment as the cutoff decreases. With this in mind, I think the min_smoothed_fdr offers the ranking that is most consistent with how DMRs are defined -- this would rank DMRs according to how long any DMR segment remains in the DMR set as the fdr cutoff decreases, right?

I agree that any secondary measure of overall significance averaging site-level significance per DMR is not particularly meaningful within DMRcate's framework, since this will presumably inflate the apparent true significance of a DMR because the site-level q-values were already used to define the boundaries of DMRs (the summary statistic would be "double dipping"). I can see now that adjusting p-values to control FDR after combining p-values will be challenging and maybe even impossible to do in any statistically valid way. I guess the best approach would be to compute the harmonic mean p-value (HMP) per DMR and report this as is, unadjusted. This could still be suitable as an alternative to min_smoothed_fdr (or to break ties) for ranking DMRs, and I advocate for outputting a ranking metric without claiming it to be a per-DMR measure of statistical significance.

Good to know about the multifactorial model in "differential" mode. Because dmrcate requires a single column of the contrast matrix to be specified with the coef argument when passing a contrast matrix, I assumed only the groups in that contrast would be used for model fitting. Maybe this could be clarified in the documentation. Would you be able to make it possible for a single call to cpg.annotate to perform all contrasts in the contrast matrix, similarly to what is possible with limma, instead of only one contrast at a time? As things are, will multiple calls to cpg.annotate with different columns of the same contrast matrix (and same data and design matrix) produce identical model coefficients?

Regarding zeros and the HMFDR, the zeros do appear to be "real" (computationally, at least). Of course, no p-value is ever truly zero, just evaluated as such given the limit of computational precision, and the actual value of such small p-values is meaningless anyway, especially given precision errors. But if a set of p-values includes a value so small it is computationally evaluated as 0, I would expect their harmonic mean to also evaluate to 0 computationally. These values came from a dmrcate ANOVA, and even though ~half of the min_smoothed_fdr values are computationally 0, the smallest HMFDR value is just a little under 1e-45 (over 200 orders of magnitude larger than the smallest representable positive number).

By the way, I came across this paper yesterday, which builds on DMRcate to address the variable CpG density between Illumina arrays and between chromosomes (and also applicable to sequencing data): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213316/#pone.0306036.s001. Would you consider integrating their smoothing kernel into the DMRcate package? Would be nice to have their methods and DMRcate all brought together into one package. Their code clearly built on DMRcate's code, so I imagine it would be fairly straightforward to just integrate their code into DMRcate's existing code.

Cheers, Josh

Hi Josh,

Many thanks for the helpful suggestions and apologies for the time lag. Yes, min_smoothed_fdr would be the best ranking metric, insofar as it's monotonic with the maximum difference between the weighted and unweighted kernel estimates. As a representative figure though, it's hugely permissive and not a good indicator of DMR significance, which is why I use Fisher instead, despite the issues you point out. You're also correct that the DMRs would shrink/fragment as the fdr cutoff tightens - so the question remains: do we consider the less significant fragment as a standalone DMR, or in context with its stronger neighbour? And how do we express this context mathematically? I don't have an answer for this, but in the meantime being the former will have to suffice, and you're more than welcome to interpret any of the four per-DMR metrics (min_smoothed_fdr, Fisher, Stouffer or HMFDR) as a ranking metric without taking them as an indicator of significance.

Re contrasts, all groups are used in model fitting, not just the non-zero ones. See Aaron Lun's response here: https://support.bioconductor.org/p/73107/.

Thanks for linking the PLOS One paper, I was unaware of it. I haven't had a chance to look at it in great detail, but it seems their assumptions about CpG density are different to that of DMRcate. For DMRcate, CpG density is itself is not a bias, but simply an indicator of the likelihood of calling a DMR via the privilege conferred by the greater degrees-of-freedom from the chi-square test. Lior Pachter had an interesting take on the irregular spacing of CpGs many years ago: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1604-3. I'd also perform the benchmarking differently by varying the fdr instead of just taking the default values.

I can't get to updating DMRcate yet, too many things happening at the moment, but all updates should be in Bioconductor by the time of the next release in October.

Cheers, Tim

timpeters82 / DMRcate-devel

Invalid application of p-value combination methods for DMR ranking in dmrcate #8