Open srialle opened 4 years ago
Hi Stephanie,
Thanks for downloading the updated methylSig and giving it a spin. I tried methylSig on one of the larger datasets I have lying around (WGBS with 17M CpGs after filtering) and I downsampled it to 6M to be similar in size to yours.
I got similarly fast results with 10K and 100K downsampling. The 1M downsampling was also fast (would have taken longer to make some coffee). It's looking like the 6M finishes in a reasonable amount of time as well (<20m). Note, I am not using local information.
Using local information does inherently take longer because of the operations needed to figure out which sites are nearby. I did some benchmarking using system.time()
:
> system.time(diff_methylsig(
+ bs = bs[1:1000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 0,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
1.012 4.329 2.498
>
> system.time(diff_methylsig(
+ bs = bs[1:1000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 200,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
5.906 7.090 3.149
> system.time(diff_methylsig(
+ bs = bs[1:10000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 0,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
9.884 7.263 3.681
>
> system.time(diff_methylsig(
+ bs = bs[1:10000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 200,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
61.449 13.705 10.876
> system.time(diff_methylsig(
+ bs = bs[1:100000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 0,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
119.481 18.577 16.162
>
> system.time(diff_methylsig(
+ bs = bs[1:100000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 200,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
816.593 62.267 92.880
> system.time(diff_methylsig(
+ bs = bs[1:6000000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 0,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
7017.760 374.039 863.185
Notice that the increase is non-linear for the same analysis modulo using local information. Given that it takes 6M CpGs without local information 863.185s / 60s = ~14.38m, I'm sure the local information test will be in the hours. (I'm still waiting on mine, but figured I should respond before it's done.)
Since I'm seeing a "reasonable" run time without local information for data about your size, do you mind trying it with local_window_size = 0
and letting me know if you get about the same?
It might be worthwhile to investigate how many CpGs fall within 200bp of another one in your data. I know it's ERRBS, so it'll likely be higher than WGBS, but maybe it's less than you think and so ditching the local information is fine. Also, if the coverage tends to be quite high perhaps that's another reason to skip local information.
One last thought. I'm noticing, when looking at memory usage, that each child process uses 10-15GB of RAM, and spread over 10 child processes and the parent process, that's 110-165GB total. This is when I only have the bsseq object loaded and nothing else. If you have other large things loaded in your R environment, you may have even more memory usage. You helpfully mentioned that your server has 192GB RAM, and I wonder if perhaps you're running out of memory? Either by this job alone, or with other people running things on your server simultaneously. To remedy this, you'll likely have to use fewer cores.
Thanks, Raymond
Hi Raymond,
Thank you so much for your reply and the tests you made. Finally the test I launched did finish: it took almost 8 hours. I just launched the same test without local information (local_window_size = 0
) and it was indeed muuuch faster: only 8 minutes! And it's true that another program was also running on the server yesterday so maybe it was running out of memory.
As you suggest I will have a look at the distribution of the CpG to see if much of them fall within 200bp of one another.
I was already using methylSig in previous versions and I was using local information, and I remember that it didn't take so long. Is it a change in the algorithm that explain this change of behaviour?
Thanks again,
Stephanie
Hi Stephanie,
It looks like mine finished overnight with total runtime of 10 hours. Yikes.
> system.time(diff_methylsig(
+ bs = bs[1:6000000],
+ group_column = 'tissue',
+ comparison_groups = c('case' = 'liver', 'control' = 'blood'),
+ disp_groups = c('case' = TRUE, 'control' = TRUE),
+ local_window_size = 200,
+ t_approx = TRUE,
+ n_cores = 10))
user system elapsed
330826.53 34952.95 39078.78
Do you recall for which version(s) local information was faster? I'm wondering if it's before or after I transitioned the package to use Bioconductor classes like GenomicRanges (v0.5.0). With that change, the way we find local CpGs changed. I'll play around to see if I can reduce the runtime, but that might take a couple weeks with work responsibilities.
For now, we can leave the issue open and I'll update you when I have something to share.
Thanks, Raymond
Hi Raymond,
I have been using different versions of methylSig, among them the 0.5.0, very recently. On the same dataset I get 5,021,511 methylation loci after methylSigReadData(), and it takes about 2 hours for methylSigCalc() to run. But at the end the object only contains 2,390,674 loci in the resulting GRanges object, so maybe the filtering was done differently and only half of the loci was tested?
Thanks,
Stephanie
I am also trying to update to the new version of methylSig and ran into the same "issue" as Stephanie above. I was using the old window size of 200 since that is what I have been doing consistently from the old version. I believe mine ran for at least 12 hours with ncores 4 and it still wasn't done when I checked it this morning. From the older version, I don't think it would ever take more than 6 hours. I ran it with a window size of 0 and it ran in about 10 minutes. I was using methylSig 0.4.4 before trying the upgrade. If increasing the window size gives a better estimate of the true methylation value then I wouldn't want to lose that functionality.
I also liked in the old version that there was a status bar showing you that the test was running. Now i just worry that it is frozen. Thanks for the new updates though, I really like the program.
As an update to what I previously posted, I let the calculation proceed with the window size 200 option and it finally finished. I don't have an exact time of it finishing but it was well over 16 hours I am assuming. There was a warning that popped up at the end and I thought it might be helpful to share in case it is a bug. This warning didn't show up when I used the window size of 0.
Warning message: In parallel::mclapply(seq_along(gr), function(locus_idx) { : scheduled cores 1, 2, 3, 4, 6, 7, 8 did not deliver results, all values of the jobs will be affected
This response is for @srialle in response to:
On the same dataset I get 5,021,511 methylation loci after methylSigReadData(), and it takes about 2 hours for methylSigCalc() to run. But at the end the object only contains 2,390,674 loci in the resulting GRanges object, so maybe the filtering was done differently and only half of the loci was tested?
In previous versions, methylSigReadData() had min/max count parameters that marked loci as having 0 coverage if the coverage was < min or > max. Then, any sites having 0 coverage in all samples would be removed. Further, with methylSigCalc() a minimum number of samples per group had to have non-zero coverage. So that would explain why your starting number of loci isn't the same as your ending number of loci.
In the newer version of methylSig we also mark loci outside of coverage bounds as 0 with filter_loci_by_coverage() and then we remove loci prior to testing with filter_loci_by_group_coverage(). So it's the same functionality, but in explicit functions, rather than part of the reading and testing functions.
I'm not really sure what to say about your comment because I don't really have enough information. If you showed your code using the older and newer versions of methylSig on the same data and the resulting number of loci at each step, we might be able to say something. But as it is, I'm not really sure how to respond to your observation.
This response is for @shawpa:
Warning message: In parallel::mclapply(seq_along(gr), function(locus_idx) { : scheduled cores 1, 2, 3, 4, 6, 7, 8 did not deliver results, all values of the jobs will be affected
Unfortunately, it's not really possible to tell what happened from this message.
I also liked in the old version that there was a status bar showing you that the test was running. Now i just worry that it is frozen. Thanks for the new updates though, I really like the program.
I agree, this is a nice feature, but I'm not sure I'll be able to prioritize if it takes too much time. If you're concerned that something might have "frozen", check on the process in Task Manager (Windows), Activity Monitor (Mac), or top (Linux).
I just wanted to ping this thread and mention that I plan to work on this prior to the next Bioconductor release (October).
Dear Raymond,
I am trying to run methylSig on RRBS data and I have a problem with diff_methylsig() function. I have a BSobj object with 5,791,812 cytosines, which I give as input to diff_methylsig(). When I extract the 10,000 first positions the function runs well (in 8 seconds), but when I try to run on the whole object, it runs for hours (more than 6 hours now) and seems to never end. I am working on a linux machine with 192G RAM and 48 available threads.
Here is the code I launched:
bs = filter_loci_by_group_coverage(
bs = bs,
group_column = 'Condition',
min_samples_per_group)
bs
pData(bs)
diff_gr = diff_methylsig( bs = bs, group_column = 'Condition', comparison_groups = comp, disp_groups = c('case' = TRUE, 'control' = TRUE), local_window_size = 200, t_approx = TRUE, n_cores = 10)
Do you have an idea while the code doesn't seem to finish?
Thank you,
Regards,
Stephanie
sessionInfo()