processing many samples and setting the number of permutations

jtomah commented 6 years ago

Hi

Thank you for making this package public. I used dmrseq on multiple case-control datasets and it worked well. More recently I tried to run dmrseq with more samples (20 vs 20) without any covariate and I got the following error:

Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
Warning message:
In combn(seq(1, nrow(design)), min(sampleSize)) :
  NAs introduced by coercion to integer range

If I understand correctly, there are too many samples and the code tries to create a permutation matrix so big that it can't be done. Then I tried 10 vs 10 and I stopped the run after 4 days because it was still at the beginning where it computes:

z <- lapply(pdx, function(p) apply(perms, 2, 
  function(x) all(x == idx[-which(idx %in% perms[, 
    p])])))

Finally I tried with 20 samples and a continuous test covariate with the default "maxPerms" parameter (20). In that case it runs to the end but I am not enough aware of the statistics to know whether there are enough permutations. Is there a criteria or somekind of thumb rule that would help setting the number of permutations ?

Thank you.

LKremer commented 6 years ago

The default value for maxPerms is 10 (even though the help text of dmrseq() says it's 20, small docu error @kdkorthauer ). So using 20 permutations should be more than enough, many people only have 2 replicates per condition and they still draw conclusions. I think you're good to go @jtomah.

kdkorthauer commented 6 years ago

Hi @jtomah,

Thanks for the detailed report and for testing out the package using so many samples. That's great that you have so many replicates! A couple of comments/questions:

What do you mean by running dmrseq "without any covariate"? A covariate (specified through the argument testCovariate is required to run the dmrseq function. Do you mean without an adjustCovariate?
The slow behavior you are experiencing with 20 versus 20 - is this for a two-group covariate (i.e. case-control)? I think this could be due to trying to enumerate and subset on the total number of possible permutations, which is extremely large in this case. This subsetting is useful in the case of small sample sizes, since it is rather easy to get very imbalanced permutations with only a few samples in each group (a common occurrence). However, this shouldn't be necessary with many samples. Since it's also impractical, I'll add a condition to skip this step and simply perform unrestricted permutation when sample size is large. Thanks for bringing this to my attention!!
To answer your question about how many permutations is enough, @LKremer is correct (thanks!) that 20 permutations should be plenty in most cases. What you're ultimately after is obtaining enough null candidate regions from the permutations -- this will depend on the level of signal in your data as well as other parameter settings, so it's hard to give an exact number that fits all cases. I would start with 10-20 permutations, and increase if you aren't getting a number of null candidate regions (total from all permutations) that is on the order of magnitude (or preferably more) of the number of observed candidates. Adding more permutations will increase the resolution of the estimate of the FDR.
Yes, the default value of maxPerms is 10 - thanks for catching the documentation error @LKremer

I'll make these changes and report back here when they are complete. Thanks again!

Best, Keegan

kdkorthauer commented 6 years ago

Hi @jtomah,

I've made the necessary changes. The comparisons with large sample sizes should no longer hang. Please try it out and let me know how it goes. Don't hesitate to reach out if you have any other questions.

Best, Keegan

jtomah commented 6 years ago

Thank you @LKremer and @kdkorthauer

Took me a while to test and respond, I had problems going to R 3.5 on my cluster.

@kdkorthauer indeed I meant using a test covariate without additionnal covariate.

So I have tested with these changes (version 1.1.4) and I was able to process multiple datasets in different configurations: 15vs15 without additional covariate and 30 using a continuous test covariate without additional covariate. I was going to report the missing "drop=FALSE" when assigning pData() that was causing me an error in the second scenario but you have corrected this in a recent commit.

@kdkorthauer also thank you for the clarification on permutations.

kdkorthauer commented 6 years ago

Excellent! Thanks for letting me know, @jtomah !

kdkorthauer / dmrseq

processing many samples and setting the number of permutations #9