combine_CpG_sites.py issue

svrever commented 7 years ago

Hi, I am trying to analyse Xenopus methylome with the DMRfinder package. Everything worked fine with the .cov files generated (Bismark) for the subset of genome (just one chromosome). When I switched to the whole genome, python script combine_CpG_sites.py fails after about 25-30 hours. It appears it is being terminated by the kernel OOM killer. Memory usage is very high and reaches 97% in about 20 min after script starts (computer has 48G RAM). Interestingly the top process monitor shows the python process to exist in a D state (uninterruptable sleep) all the way till it is killed. Can you please help?

Best, Sergei

jsh58 commented 7 years ago

Sergei,

Thanks for your message and for trying DMRfinder.

I have never experienced excessive times or memory usage with combine_CpG_sites.py, so it is difficult for me to assess the problem without access to the files. Can you send the first few lines of your .cov file(s), as well as the exact command you used to run combine_CpG_sites.py?

Thanks,

John Gaspar

svrever commented 7 years ago

Hi John, thank you for your answer. I have six .cov files each 2.3-2.4G. Here is first 20 lines of the first file:

Scaffold19 423 423 100 1 0 Scaffold19 424 424 75 3 1 Scaffold19 448 448 100 1 0 Scaffold19 449 449 100 4 0 Scaffold19 495 495 100 3 0 Scaffold19 496 496 100 4 0 Scaffold19 501 501 100 3 0 Scaffold19 502 502 100 4 0 Scaffold19 635 635 100 1 0 Scaffold19 636 636 100 2 0 Scaffold19 647 647 100 1 0 Scaffold19 648 648 100 2 0 Scaffold19 661 661 100 1 0 Scaffold19 958 958 100 1 0 Scaffold19 959 959 100 2 0 Scaffold19 968 968 100 1 0 Scaffold19 969 969 100 2 0 Scaffold19 979 979 100 1 0 Scaffold19 980 980 100 2 0 Scaffold19 1018 1018 100 1 0

Xenopus genome has many tens of thousands of unassigned scaffolds (besides chromosomes), and that might substantially complicate the issue. The command I used is

$ python combine_CpG_sites.py -d 150 -x 1000 -o SCtd_XL1-6_meth-reg.csv XL4.cov XL5.cov XL6.cov XL1.cov XL2.cov XL3.cov

Thanks for looking into this, Sergei

svrever commented 7 years ago

Is there a way maybe to limit the memory usage for a python script, especially the swap memory?

Sergei

jsh58 commented 7 years ago

Sergei,

You can limit memory by using e.g. ulimit, but that doesn't solve the problem if combine_CpG_sites.py hasn't finished.

Do you know how many lines your .cov files are right now? I have clustered six samples with ~25 million records each, and it took less than 10 minutes and memory wasn't an issue.

Here are a few suggestions:

Add '-v' to your combine_CpG_sites.py command so it will run in verbose mode, and you can see how far it gets before slowing down.
Merge the counts for the 'C' and the 'G' of each CpG. This can be done by coverage2cytosine in Bismark, or by extract_CpG_data.py in DMRfinder. And it just makes sense to combine the counts anyway -- why have separate counts for one CpG site? It should convert this: Scaffold19 423 423 100 1 0 Scaffold19 424 424 75 3 1 into this: Scaffold19 423 424 80 4 1 This will reduce the size of your .cov files by about 50%, and will help when combine_CpG_sites.py loads them into memory.
Split your .cov files by chromosome/scaffold, cluster them separately, and then combine the results. This will certainly be successful, but it would take a bit of work, so I wouldn't recommend it at this point.

Let me know how it goes.

John Gaspar

svrever commented 7 years ago

Hi John, I tried using different options of ulimit, but combine_CpG_sites.py crushes reaching the set memory limit, so it is of no use. I ran it again on the subset of genome (chr7S) which is 6X ~1.26 million lines. It finished in one minute and the RAM use went up to 2.6G. Whole genome .cov files have ~70 million lines, so by interpolation that would require 140-150G of RAM. I have only 48G on my system, so when I start the script on whole genome memory consumption hits the ceiling of 97% in about 15 minutes, then it enters the uninterruptible sleep state "D". At this stage process starts consuming the swap memory (I have 16G) and intense IO writing starts. After all swap is eaten up (takes about a day) process is shot by the OOM killer.
How many memory did it use on your system to process 25 million records? Is there something wrong with mine?

Sergei

svrever commented 7 years ago

As for the merging C and G counts, I thought Bismark's "comprehensive" mode should have taken care of that. I did not pay attention, Scaffold 19 definitely looks weird. Usually it is more like this:

chr7S 71928589 71928590 25.000000 2 6 chr7S 71928602 71928603 44.444444 4 5 chr7S 71928608 71928609 0.000000 0 8 chr7S 71928635 71928636 50.000000 2 2 chr7S 71928653 71928654 66.666667 4 2 chr7S 71928688 71928689 80.000000 8 2 chr7S 71928758 71928759 70.000000 7 3 chr7S 71928858 71928859 75.000000 3 1 chr7S 71928882 71928883 75.000000 3 1 chr7S 71928901 71928902 100.000000 3 0 chr7S 71928937 71928938 33.333333 1 2 chr7S 71928946 71928947 75.000000 3 1 chr7S 71928951 71928952 60.000000 3 2 chr7S 71928982 71928983 80.000000 4 1 chr7S 71929013 71929014 100.000000 5 0 chr7S 71929088 71929089 80.000000 4 1 chr7S 71929133 71929134 80.000000 4 1 chr7S 71929150 71929151 33.333333 2 4 chr7S 71929195 71929196 100.000000 6 0 chr7S 71929295 71929296 87.500000 7 1 chr7S 71929312 71929313 100.000000 9 0 chr7S 71929321 71929322 100.000000 10 0 chr7S 71929326 71929327 100.000000 10 0 chr7S 71929343 71929344 100.000000 7 0 chr7S 71929405 71929406 100.000000 9 0 chr7S 71929444 71929445 100.000000 6 0 chr7S 71929461 71929462 75.000000 3 1 chr7S 71929475 71929476 100.000000 4 0 chr7S 71929505 71929506 66.666667 2 1

may be there is something in how Bismark processes scaffolds? Anyway, I do not completely understand your recommendation of using "coverage2cytosine" script. According to the manual, "The output considers all Cs on both forward and reverse strands and reports their position, strand, trinucleotide content and methylation state. Every cytosine on both the top and bottom strands will be considered irrespective of whether they were actually covered by any reads in the experiment or not". Theoretically that should add some lines with C's not covered by any reads and still report CpG's from both strands. And the output from "coverage2cytosine" script - - is not compatible as input for combine_CpG_sites.py

Sergei

jsh58 commented 7 years ago

Sergei,

Re: merged counts, I cannot speak to what may have changed with Bismark lately, but in the version I used (0.15.0), coverage2cytosine with --merge_CpG was necessary to produce merged counts (and it did produce an output .cov file that was compatible with combine_CpG_sites.py -- the primary output file was not, as you indicated). I do not think bismark_methylation_extractor treat a "scaffold" differently from a "chr." Again, extract_CpG_data.py in DMRfinder produces merged counts directly, no nonsense.

Re: memory, I do not recall how much memory the process required, but my system was similar to yours (64Gb) and I would have noticed if a ton was being used. What version of python are you using?

I have re-examined my code and see a couple spots where memory could be improved, but probably not enough to solve your problem (if your extrapolation is correct, 150Gb, maybe reduced to 75Gb with merged files, and my edits might save 20% -> 60Gb is still a lot). If you are willing to try it, I'll see if I can write it up in a day or two.

John Gaspar

svrever commented 7 years ago

Thanks for looking into it John! If you can modify the scrit, I would gladly test it. Meanwhile I will see what I can do to merge CpG's if any, and reduce the file sizes. Also I wanted to mention - when running script with the -v option it seems to be stuck on "loading XL4 into the memory" all the way while the RAM fills up. Sergei

On Oct 13, 2017 2:28 PM, "John M. Gaspar" notifications@github.com wrote:

Sergei,

Re: merged counts, I cannot speak to what may have changed with Bismark lately, but in the version I used (0.15.0), coverage2cytosine with --merge_CpG was necessary to produce merged counts (and it did produce an output .cov file that was compatible with combine_CpG_sites.py -- the primary output file was not, as you indicated). I do not think bismark_methylation_extractor treat a "scaffold" differently from a "chr." Again, extract_CpG_data.py in DMRfinder produces merged counts directly, no nonsense.

Re: memory, I do not recall how much memory the process required, but my system was similar to yours (64Gb) and I would have noticed if a ton was being used. What version of python are you using?

I have re-examined my code and see a couple spots where memory could be improved, but probably not enough to solve your problem (if your extrapolation is correct, 150Gb, maybe reduced to 75Gb with merged files, and my edits might save 20% -> 60Gb is still a lot). If you are willing to try it, I'll see if I can write it up in a day or two.

John Gaspar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jsh58/DMRfinder/issues/1#issuecomment-336532237, or mute the thread https://github.com/notifications/unsubscribe-auth/AeTbHwJGg4SeHkA-mw65GPphLpQWLC6rks5sr6u-gaJpZM4P3l2z .

jsh58 commented 7 years ago

Sergei,

I have restructured the script. Now, if you specify -b, the program will analyze one chromosome at a time. This uses a lot less memory than loading each cov file. The new version can be downloaded here.

Note that the input files must be in the same sort order, otherwise the script will complain. This may be an issue with Bismark's output, but of course is not an issue with DMRfinder's extract_CpG_data.py.

Please let me know if you find any other issues.

John Gaspar

svrever commented 7 years ago

Hi John, thank you for the new script. I ran it with the -b option on my .cov files and got an "Error! Input files not in same order" message. I resorted all files with _LCCOLLATE=C sort -k1,1 -k2,2n and ran it again, got the same error. Is there any specific sorting order the script is looking for? I would like to use option -e, but I am not sure how I can list there all 402483 scaffolds. I also tried to the use script _extract_CpGdata.py on the Bismark-produced .bam alignment file (48G) to produce merged CpG counts. Script behaviour is similar to _combine_CpGsites.py - saturates RAM, slowly eats up swap and getts into D state. It did not finish yet, will know tomorrow. Is there the -b option for this script too?

Sergei

jsh58 commented 7 years ago

Sergei,

The script does not require a specific order, only that the files are in the same order. If there are multiple samples that are missing counts for multiple chromosomes/scaffolds, it might not be able to discern the correct order. This may be your issue. But if there is at least one sample that is not missing any chrom/scaffolds, or if the SAM/BAM file lists all the chrom/scaffolds in the same order as the cov files, it would be easy to construct a list to give to -e using awk.

There is no memory-saving option with extract_CpG_data.py, since it was designed to be able to analyze an unsorted SAM and produce sorted results.

John Gaspar

svrever commented 7 years ago

Can I feed a list for an -e option as a file? How would the command line look then?

jsh58 commented 7 years ago

Sergei,

If you had a file listing the ordered chromosome/scaffold names, comma-separated (on one line, no spaces), you could probably use command substitution: -e $(< file) or -e $(cat file). Hopefully your shell won't choke on the length of the command...

John Gaspar

svrever commented 7 years ago

Thanks, will try that. You think the list of all scaffolds/chromosomes (from genome .fai file) will work as well?

svrever commented 7 years ago

It was worth a try...

bash: /usr/bin/python: Argument list too long

jsh58 commented 7 years ago

Sergei,

I modified the script so now the -e argument takes a file listing ordered chromosome names. Please let me know if it still doesn't work.

John Gaspar

svrever commented 7 years ago

Thanks John, the new version of the script worked beautifully. In about an hour it generated the regions file of about 3.5 million lines, and memory was not an issue. Now I am trying to get merged .cov files, and unfortunately the script _extract_CpGdata.py is also being shot by OOM kernel killer. It is probably an issue of many scaffolds. Is there a way to modify it in a similar way? Thanks for all of your hard work! :-)

Sergei

jsh58 commented 7 years ago

Sergei,

Great, glad to hear it worked! I will keep the -e option as a file.

The only way I can see to make extract_CpG_data.py more efficient would require that the BAM file be sorted in advance. Sorting a huge BAM file like yours (by samtools sort) may already be an issue on your computer.

I'll try to take a look at my code, but it will be a few days (at a minimum). In the meantime, you may want to take another look at coverage2cytosine with --merge_CpG.

John Gaspar

svrever commented 7 years ago

Thanks, I will try that.

Sergei

jsh58 commented 7 years ago

Sergei,

OK, I will close this issue and open a new one for extract_CpG_data.py.

John Gaspar

jsh58 / DMRfinder

combine_CpG_sites.py issue #1