kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

Conservativeness of test #39

Closed bazyliszek closed 3 years ago

bazyliszek commented 4 years ago

Dear Keegan,

I have few questions related to this nice program.

  1. Conservativeness of performed test, qvalues. I read https://github.com/kdkorthauer/dmrseq/issues/19 and I have similar problem. I applied 0.025 cutoff, run 20-100 permutations, and rest are just default parameters of the program. I do not observe any clear peak in the distribution of p-values.

I have 98 human serum samples where half of these are controls. There are 3 covariates available: “Smoking (factor)”, “Age (continuous)” and “time frame when samples were taken (TST) (factor)”. Later, I consider to use deconvolution methods to estimate cell composition and use it also as covariates. I used adjustCovariate parameters for all these 3 covariates now. I noticed that variable “TST” is overall significant so I started to wonder if I should controlled this variable when null distribution is drawn …

I might be wrong but when constructing statistical test this peak is expected. When looking at small differences between 2 groups, qvalue are very large and test seems to be very conservative. Therefore looking at regions of pvalue smaller than 0.05 sometimes is a choice people take. I am about to take this choice.

I was wondering if I could apply matchCovariates for the permutations of null distribution at the same time as adding this variable as adjustCovariate (I got error when trying this out so I understand it is not allowed). I would like to somehow exclude possibility that null distribution is affected by this TST variable. Does it make sens at all?

  1. I wonder how I can get the individual coefficients for these 3 covariates?

  2. I am tempted for adding these other variable (5 different cell populations) but I worry a bit for overfitting. Do you have any recommendation for that?

Cheers, Marcin

kdkorthauer commented 4 years ago

Hi Marcin,

Thanks for your questions. Please see my responses below.

Best, Keegan

  1. Conservativeness of performed test, qvalues. I read #19 and I have similar problem. I applied 0.025 cutoff, run 20-100 permutations, and rest are just default parameters of the program. I do not observe any clear peak in the distribution of p-values.

    I have 98 human serum samples where half of these are controls. There are 3 covariates available: “Smoking (factor)”, “Age (continuous)” and “time frame when samples were taken (TST) (factor)”. Later, I consider to use deconvolution methods to estimate cell composition and use it also as covariates. I used adjustCovariate parameters for all these 3 covariates now. I noticed that variable “TST” is overall significant so I started to wonder if I should controlled this variable when null distribution is drawn …

For any variables you think might independently influence or correlate with methylation levels, yes, you should consider (1) controlling for them in the experimental design (preferable), or (2) adjusting for them in the analysis.

I might be wrong but when constructing statistical test this peak is expected. When looking at small differences between 2 groups, qvalue are very large and test seems to be very conservative. Therefore looking at regions of pvalue smaller than 0.05 sometimes is a choice people take. I am about to take this choice.

I think you are referring here to the peak that can be seen in the distribution of p-values when the tests carried out are a mixture of null and the alternative. Yes, if the null hypothesis is not true in at least some cases, and if our experimental design and statistical test has sufficient power to detect the alternatives, then we see this as a peak in the small p-values. The rest of the p-values (under the null) should be uniformly distributed. Unfortunately, our study design and/or statistical test is not always powered to detect the non-nulls.

When choosing a significance cutoff, it is important to adjust for multiple comparisons. Just looking at p-values (e.g. a 0.05 cutoff) will likely yield a high false discovery rate. For this reason, q-values are provided.

I was wondering if I could apply matchCovariates for the permutations of null distribution at the same time as adding this variable as adjustCovariate (I got error when trying this out so I understand it is not allowed). I would like to somehow exclude possibility that null distribution is affected by this TST variable. Does it make sens at all?

the matchCovariate argument can only accommodate a single two group factor. If one of your covariates is a two-group factor, you can use it to restrict the permutations (which is a more direct way of adjusting for the covariate).

  1. I wonder how I can get the individual coefficients for these 3 covariates?

dmrseq does not report the coefficients for the adjusted covariates. If you are interested in the effects of other covariates, you can use them as the testCovariate (only one at a time).

  1. I am tempted for adding these other variable (5 different cell populations) but I worry a bit for overfitting. Do you have any recommendation for that?

I'm not sure what you mean here.

bazyliszek commented 4 years ago

Hi Keegan,

Thank you for your answer. Yes, I was referring to the p-values distributions. Indeed, I was expecting a tail of p-values for given statistical test (so that a test is constructed in a such a way that it will always produce tail ) with this significance despairing after false discovery rate correction (q-values). I understand this is not true now. I wonder if in such case, would it be worth changing settings of parameters for DMRs? Is there any strategy for that? I use DNA isolated from human serum/blood DNA. I understand that parameters were optimized for human DNA independent of the cell type? I have gut feeling that I will find some individual CpGs significant, using methylKit for instance.

Please ignore Q3. This was more about power. Since I have 45 samples with cancer and 45 controls matched by all these parameters, I might have not enough power to implement all covariates I have, so I plan to include the most important one.

Many thanks,

kdkorthauer commented 4 years ago

Hi,

You may try to increase the smoothing bandwidth to see if you detect larger blocks of methylation changes (see this section of the dmrseq vignette).

On the other hand, it is also possible that you will find significant differentially methylated CpGs (DMCs) despite not finding significant DMRs. This is because these are two different types of signals, and while the presence of DMRs implies the presence of DMCs, the converse is not necessarily true.

Best, Keegan