smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
67 stars 27 forks source link

Memory Error when calling pmd #179

Closed shashwatsahay closed 1 year ago

shashwatsahay commented 3 years ago

Hi Again, I am currently running the pmd submodule on the output of symmetric-cpg and I am doing these chromosomes wise. It runs well for all the other chromosome other than for chromosome 22 for all of my samples. The problem as I seem to understand is the program is running out of memory (372 GB's of it)

I am attaching two files from different sample containing the truncated output (first and last 10000 cpgs) of the symmetric-cpg for you to have a look

testsample1_chr22_CpG.txt testsample2_chr22_CpG.txt

andrewdavidsmith commented 3 years ago

@shashwatsahay thanks for this issue. I think I understand the problem, and will hopefully be able to make a fix soon. On my end I'm able to see pmd working for many data sets, but I see the problem with the test data (thanks for that!). I suspect we don't want pmd to actually identify any pmds in data exactly as you have provided, but at the same time, it definitely shouldn't behave as it currently is.

shashwatsahay commented 3 years ago

Hey @andrewdavidsmith

do you have updates if this was fixed? Also, could you let us know what exactly is the problem

andrewdavidsmith commented 3 years ago

@shashwatsahay I can't give you a time for a fix, unfortunately, but thanks for the reminder. I can tell you that the problem comes from an inability of the program to find a "good" bin size, which only happens for data sets with high variance in coverage, and generally low coverage. I was able to run the program on several "typical" data sets in the public domain, for mammalian methylomes with PMDs (I realize now I didn't test it recently on data known to not have PMDs). I definitely see the problem, and it amounts to the program continually building data structures as it attempts to find a bin size that "works" but when it cannot, it keeps trying and ends up using that amount of memory you observed. The fix will require me (or someone) to rewrite the portion of the code that searches for a good bin size so that memory is deallocated properly. Doing this might take several hours, and in the next couple weeks (at least) I don't have that time. Please feel free to send me a direct email if you don't hear anything in 3 weeks. Sorry this is taking longer than it should. Even if I can't fix it in the repo by then, I might be able to get you a patch that will circumvent the problem in your specific case.

andrewdavidsmith commented 2 years ago

@shashwatsahay Would it be possible for you to check if some test sample can work for another chromosome? For example, if the first and last 10000 cogs on chr21 can lead to good results? Using the test data you provided, and hacking the code a bit so it doesn't crash, I don't get output because there simply shouldn't be any. So it's a bit tough for me to know if I'm fixing the problem.

abcoxyzide commented 2 years ago

Hi I just want to chip in.

Had problem specifically with chr22 too. Was using version 4.1.1

Solved with version 5.0.1; though I have to say compiling it was not easy!