kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

Computational efficiency for continuous testCovariates #8

Closed lulizou closed 6 years ago

lulizou commented 6 years ago

Hi,

Thank you for adding in the ability to deal with many categories/continuous covariates! I'm wondering what might be some "best practice" strategies for adjusting minInSpan, bpSpan, maxGapSmooth, and maxGap for continuous testCovariates. The defaults and the recommendations for block = T from the vignette seem to work great for categorical covariates, but for continuous ones it sometimes takes ~6 hours to score regions on one chromosome. Or, is this lengthy computational time unavoidable? Also, if I lower these parameters arbitrarily, will it affect the accuracy of the regions/inference?

Thanks!

kdkorthauer commented 6 years ago

Hi @lulizou,

Thanks for your question. The variation in computation time is, unfortunately, largely unavoidable. As you have observed, adjusting the smoothing parameters, the block parameter, and the type of covariate can all influence how long it takes to carry out inference. However, you can adjust these parameters to favor lower computation time, as long as this will still allow you to answer your biological question. For example, if you are interested in finding shorter 'local' regions, it would not be advised to increase the smoothing parameters and use the block setting since this is aimed at detecting large-scale differences.

Here are some guidelines to keep in mind about how changing the parameters will affect computation time. In general, timing is influenced to the largest extent by the following factors:

The good news is that adjusting these parameters will not affect the accuracy of inference. Your error rate in the form of the False Discovery Rate will still be controlled at the region level (i.e. you aren't going to see an increase in false positives as a result of tuning the settings). However, loss of power could occur if, for example, not enough candidate regions are detected.

Hope that helps, and please feel free to reach out if you have any more questions or suggestions.

Best, Keegan

lulizou commented 6 years ago

Thanks, these are helpful guidelines. I will try to do some experimentation with these parameters.