kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

processing many samples and setting the number of permutations #9

Closed jtomah closed 6 years ago

jtomah commented 6 years ago

Hi

Thank you for making this package public. I used dmrseq on multiple case-control datasets and it worked well. More recently I tried to run dmrseq with more samples (20 vs 20) without any covariate and I got the following error:

Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
Warning message:
In combn(seq(1, nrow(design)), min(sampleSize)) :
  NAs introduced by coercion to integer range

If I understand correctly, there are too many samples and the code tries to create a permutation matrix so big that it can't be done. Then I tried 10 vs 10 and I stopped the run after 4 days because it was still at the beginning where it computes:

z <- lapply(pdx, function(p) apply(perms, 2, 
  function(x) all(x == idx[-which(idx %in% perms[, 
    p])])))

Finally I tried with 20 samples and a continuous test covariate with the default "maxPerms" parameter (20). In that case it runs to the end but I am not enough aware of the statistics to know whether there are enough permutations. Is there a criteria or somekind of thumb rule that would help setting the number of permutations ?

Thank you.

LKremer commented 6 years ago

The default value for maxPerms is 10 (even though the help text of dmrseq() says it's 20, small docu error @kdkorthauer ). So using 20 permutations should be more than enough, many people only have 2 replicates per condition and they still draw conclusions. I think you're good to go @jtomah.

kdkorthauer commented 6 years ago

Hi @jtomah,

Thanks for the detailed report and for testing out the package using so many samples. That's great that you have so many replicates! A couple of comments/questions:

I'll make these changes and report back here when they are complete. Thanks again!

Best, Keegan

kdkorthauer commented 6 years ago

Hi @jtomah,

I've made the necessary changes. The comparisons with large sample sizes should no longer hang. Please try it out and let me know how it goes. Don't hesitate to reach out if you have any other questions.

Best, Keegan

jtomah commented 6 years ago

Thank you @LKremer and @kdkorthauer

Took me a while to test and respond, I had problems going to R 3.5 on my cluster.

@kdkorthauer indeed I meant using a test covariate without additionnal covariate.

So I have tested with these changes (version 1.1.4) and I was able to process multiple datasets in different configurations: 15vs15 without additional covariate and 30 using a continuous test covariate without additional covariate. I was going to report the missing "drop=FALSE" when assigning pData() that was causing me an error in the second scenario but you have corrected this in a recent commit.

@kdkorthauer also thank you for the clarification on permutations.

kdkorthauer commented 6 years ago

Excellent! Thanks for letting me know, @jtomah !