sartorlab / methylSig

R package for DNA methylation analysis
17 stars 5 forks source link

Problem while running diff_methylsig() #42

Open srialle opened 4 years ago

srialle commented 4 years ago

Dear Raymond,

I am trying to run methylSig on RRBS data and I have a problem with diff_methylsig() function. I have a BSobj object with 5,791,812 cytosines, which I give as input to diff_methylsig(). When I extract the 10,000 first positions the function runs well (in 8 seconds), but when I try to run on the whole object, it runs for hours (more than 6 hours now) and seems to never end. I am working on a linux machine with 192G RAM and 48 available threads.

Here is the code I launched:

bs = filter_loci_by_group_coverage( bs = bs, group_column = 'Condition', min_samples_per_group) bs

An object of type 'BSseq' with 5791812 methylation loci 4 samples has not been smoothed All assays are in-memory

pData(bs)

DataFrame with 4 rows and 1 column Condition

493-G3 493 493-GM14 493 371-2 371 371-5 371

diff_gr = diff_methylsig( bs = bs, group_column = 'Condition', comparison_groups = comp, disp_groups = c('case' = TRUE, 'control' = TRUE), local_window_size = 200, t_approx = TRUE, n_cores = 10)

Do you have an idea while the code doesn't seem to finish?

Thank you,

Regards,

Stephanie

sessionInfo()

R version 3.6.2 (2019-12-12) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 8 (Core)

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] methylSig_0.99.0 bsseq_1.22.0
[3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
[5] BiocParallel_1.20.0 matrixStats_0.55.0
[7] Biobase_2.46.0 GenomicRanges_1.38.0
[9] GenomeInfoDb_1.22.0 IRanges_2.20.1
[11] S4Vectors_0.24.1 BiocGenerics_0.32.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.3 compiler_3.6.2 XVector_0.26.0
[4] DSS_2.34.0 R.methodsS3_1.7.1 R.utils_2.9.0
[7] bitops_1.0-6 tools_3.6.2 DelayedMatrixStats_1.8.0 [10] zlibbioc_1.32.0 lifecycle_0.1.0 rhdf5_2.30.1
[13] lattice_0.20-38 BSgenome_1.54.0 rlang_0.4.2
[16] Matrix_1.2-18 GenomeInfoDbData_1.2.2 rtracklayer_1.46.0
[19] Biostrings_2.54.0 gtools_3.8.1 locfit_1.5-9.1
[22] grid_3.6.2 data.table_1.12.6 R6_2.4.1
[25] HDF5Array_1.14.1 XML_3.98-1.20 limma_3.42.0
[28] Rhdf5lib_1.8.0 GenomicAlignments_1.22.1 scales_1.1.0
[31] Rsamtools_2.2.1 permute_0.9-5 colorspace_1.4-1
[34] RCurl_1.95-4.12 munsell_0.5.0 R.oo_1.23.0

rcavalcante commented 4 years ago

Hi Stephanie,

Thanks for downloading the updated methylSig and giving it a spin. I tried methylSig on one of the larger datasets I have lying around (WGBS with 17M CpGs after filtering) and I downsampled it to 6M to be similar in size to yours.

I got similarly fast results with 10K and 100K downsampling. The 1M downsampling was also fast (would have taken longer to make some coffee). It's looking like the 6M finishes in a reasonable amount of time as well (<20m). Note, I am not using local information.

Using local information does inherently take longer because of the operations needed to figure out which sites are nearby. I did some benchmarking using system.time():

1K CpGs

> system.time(diff_methylsig(
+     bs = bs[1:1000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 0,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed 
  1.012   4.329   2.498 
> 
> system.time(diff_methylsig(
+     bs = bs[1:1000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 200,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed 
  5.906   7.090   3.149 

10K CpGs

> system.time(diff_methylsig(
+     bs = bs[1:10000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 0,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed 
  9.884   7.263   3.681 
> 
> system.time(diff_methylsig(
+     bs = bs[1:10000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 200,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed 
 61.449  13.705  10.876 

100K CpGs

> system.time(diff_methylsig(
+     bs = bs[1:100000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 0,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed
119.481  18.577  16.162
>
> system.time(diff_methylsig(
+     bs = bs[1:100000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 200,
+     t_approx = TRUE,
+     n_cores = 10))
   user  system elapsed 
816.593  62.267  92.880 

6M CpGs

> system.time(diff_methylsig(
+     bs = bs[1:6000000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 0,
+     t_approx = TRUE,
+     n_cores = 10))
    user   system  elapsed 
7017.760  374.039  863.185 

Notice that the increase is non-linear for the same analysis modulo using local information. Given that it takes 6M CpGs without local information 863.185s / 60s = ~14.38m, I'm sure the local information test will be in the hours. (I'm still waiting on mine, but figured I should respond before it's done.)

Since I'm seeing a "reasonable" run time without local information for data about your size, do you mind trying it with local_window_size = 0 and letting me know if you get about the same?

It might be worthwhile to investigate how many CpGs fall within 200bp of another one in your data. I know it's ERRBS, so it'll likely be higher than WGBS, but maybe it's less than you think and so ditching the local information is fine. Also, if the coverage tends to be quite high perhaps that's another reason to skip local information.

One last thought. I'm noticing, when looking at memory usage, that each child process uses 10-15GB of RAM, and spread over 10 child processes and the parent process, that's 110-165GB total. This is when I only have the bsseq object loaded and nothing else. If you have other large things loaded in your R environment, you may have even more memory usage. You helpfully mentioned that your server has 192GB RAM, and I wonder if perhaps you're running out of memory? Either by this job alone, or with other people running things on your server simultaneously. To remedy this, you'll likely have to use fewer cores.

Thanks, Raymond

srialle commented 4 years ago

Hi Raymond, Thank you so much for your reply and the tests you made. Finally the test I launched did finish: it took almost 8 hours. I just launched the same test without local information (local_window_size = 0) and it was indeed muuuch faster: only 8 minutes! And it's true that another program was also running on the server yesterday so maybe it was running out of memory. As you suggest I will have a look at the distribution of the CpG to see if much of them fall within 200bp of one another. I was already using methylSig in previous versions and I was using local information, and I remember that it didn't take so long. Is it a change in the algorithm that explain this change of behaviour? Thanks again, Stephanie

rcavalcante commented 4 years ago

Hi Stephanie,

It looks like mine finished overnight with total runtime of 10 hours. Yikes.

> system.time(diff_methylsig(
+     bs = bs[1:6000000],
+     group_column = 'tissue',
+     comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+     disp_groups = c('case' = TRUE, 'control' = TRUE),
+     local_window_size = 200,
+     t_approx = TRUE,
+     n_cores = 10))
     user    system   elapsed
330826.53  34952.95  39078.78

Do you recall for which version(s) local information was faster? I'm wondering if it's before or after I transitioned the package to use Bioconductor classes like GenomicRanges (v0.5.0). With that change, the way we find local CpGs changed. I'll play around to see if I can reduce the runtime, but that might take a couple weeks with work responsibilities.

For now, we can leave the issue open and I'll update you when I have something to share.

Thanks, Raymond

srialle commented 4 years ago

Hi Raymond,

I have been using different versions of methylSig, among them the 0.5.0, very recently. On the same dataset I get 5,021,511 methylation loci after methylSigReadData(), and it takes about 2 hours for methylSigCalc() to run. But at the end the object only contains 2,390,674 loci in the resulting GRanges object, so maybe the filtering was done differently and only half of the loci was tested?

Thanks,

Stephanie

shawpa commented 4 years ago

I am also trying to update to the new version of methylSig and ran into the same "issue" as Stephanie above. I was using the old window size of 200 since that is what I have been doing consistently from the old version. I believe mine ran for at least 12 hours with ncores 4 and it still wasn't done when I checked it this morning. From the older version, I don't think it would ever take more than 6 hours. I ran it with a window size of 0 and it ran in about 10 minutes. I was using methylSig 0.4.4 before trying the upgrade. If increasing the window size gives a better estimate of the true methylation value then I wouldn't want to lose that functionality.

I also liked in the old version that there was a status bar showing you that the test was running. Now i just worry that it is frozen. Thanks for the new updates though, I really like the program.

shawpa commented 4 years ago

As an update to what I previously posted, I let the calculation proceed with the window size 200 option and it finally finished. I don't have an exact time of it finishing but it was well over 16 hours I am assuming. There was a warning that popped up at the end and I thought it might be helpful to share in case it is a bug. This warning didn't show up when I used the window size of 0.

Warning message: In parallel::mclapply(seq_along(gr), function(locus_idx) { : scheduled cores 1, 2, 3, 4, 6, 7, 8 did not deliver results, all values of the jobs will be affected

rcavalcante commented 4 years ago

This response is for @srialle in response to:

On the same dataset I get 5,021,511 methylation loci after methylSigReadData(), and it takes about 2 hours for methylSigCalc() to run. But at the end the object only contains 2,390,674 loci in the resulting GRanges object, so maybe the filtering was done differently and only half of the loci was tested?

In previous versions, methylSigReadData() had min/max count parameters that marked loci as having 0 coverage if the coverage was < min or > max. Then, any sites having 0 coverage in all samples would be removed. Further, with methylSigCalc() a minimum number of samples per group had to have non-zero coverage. So that would explain why your starting number of loci isn't the same as your ending number of loci.

In the newer version of methylSig we also mark loci outside of coverage bounds as 0 with filter_loci_by_coverage() and then we remove loci prior to testing with filter_loci_by_group_coverage(). So it's the same functionality, but in explicit functions, rather than part of the reading and testing functions.

I'm not really sure what to say about your comment because I don't really have enough information. If you showed your code using the older and newer versions of methylSig on the same data and the resulting number of loci at each step, we might be able to say something. But as it is, I'm not really sure how to respond to your observation.

rcavalcante commented 4 years ago

This response is for @shawpa:

Warning message: In parallel::mclapply(seq_along(gr), function(locus_idx) { : scheduled cores 1, 2, 3, 4, 6, 7, 8 did not deliver results, all values of the jobs will be affected

Unfortunately, it's not really possible to tell what happened from this message.

I also liked in the old version that there was a status bar showing you that the test was running. Now i just worry that it is frozen. Thanks for the new updates though, I really like the program.

I agree, this is a nice feature, but I'm not sure I'll be able to prioritize if it takes too much time. If you're concerned that something might have "frozen", check on the process in Task Manager (Windows), Activity Monitor (Mac), or top (Linux).

rcavalcante commented 4 years ago

I just wanted to ping this thread and mention that I plan to work on this prior to the next Bioconductor release (October).