Computational efficiency for continuous testCovariates

lulizou commented 6 years ago

Hi,

Thank you for adding in the ability to deal with many categories/continuous covariates! I'm wondering what might be some "best practice" strategies for adjusting minInSpan, bpSpan, maxGapSmooth, and maxGap for continuous testCovariates. The defaults and the recommendations for block = T from the vignette seem to work great for categorical covariates, but for continuous ones it sometimes takes ~6 hours to score regions on one chromosome. Or, is this lengthy computational time unavoidable? Also, if I lower these parameters arbitrarily, will it affect the accuracy of the regions/inference?

Thanks!

kdkorthauer commented 6 years ago

Hi @lulizou,

Thanks for your question. The variation in computation time is, unfortunately, largely unavoidable. As you have observed, adjusting the smoothing parameters, the block parameter, and the type of covariate can all influence how long it takes to carry out inference. However, you can adjust these parameters to favor lower computation time, as long as this will still allow you to answer your biological question. For example, if you are interested in finding shorter 'local' regions, it would not be advised to increase the smoothing parameters and use the block setting since this is aimed at detecting large-scale differences.

Here are some guidelines to keep in mind about how changing the parameters will affect computation time. In general, timing is influenced to the largest extent by the following factors:

smoothing parameters minInSpan, bpSpan, maxGapSmooth and maxGap: increasing these parameters will result in more smoothing, allowing you to detect larger regions of methylation difference (if they exist) by smoothing over noisier shorter segments. While this may result in detection of a smaller number of regions, the regions detected may be larger, so the fitting of each region takes longer. This is particularly pronounced with extremely large regions with many hundreds of CpGs. In this case, it is recommended to set block=TRUE (see next bullet).
block finding parameter block can be set to TRUE, which will speed up fitting of very large regions. In addition, this setting ignores any regions that span fewer than blockSize basepairs (default=5kb). But again, this setting should only be used if it aligns with the biological question of interest.
The cutoff parameter cutoff (default 0.10) can be adjusted. Lowering will tend to find more and longer candidate regions, but will increase computation time. While increasing will lower computation time, keep in mind that power to detect differential regions with small methylation differences will be reduced.
Restricting the number of permutations with the maxPerms parameter will help with reducing computation time, especially for continuous covariates or categorical covariates with a larger sample size. Often just carrying out a few permutations will give enough null candidate regions in order to perform accurate inference. If, however, some permutations find none or very few candidate regions, this parameter should be increased.
Using multiple cores if you have them available will also help to speed up computation since the work is spread out over multiple processes.

The good news is that adjusting these parameters will not affect the accuracy of inference. Your error rate in the form of the False Discovery Rate will still be controlled at the region level (i.e. you aren't going to see an increase in false positives as a result of tuning the settings). However, loss of power could occur if, for example, not enough candidate regions are detected.

Hope that helps, and please feel free to reach out if you have any more questions or suggestions.

Best, Keegan

lulizou commented 6 years ago

Thanks, these are helpful guidelines. I will try to do some experimentation with these parameters.

kdkorthauer / dmrseq

Computational efficiency for continuous testCovariates #8