Closed jsh58 closed 7 years ago
Sergei (@svrever),
I have now revised extract_CpG_data.py
with a memory-saving option -b
. It is available here. It requires that the SAM be coordinate-sorted in advance.
Please let me know if you encounter any difficulties with it.
John Gaspar
Hi John, thanks for the updated version of the script. I started it on the 35G .bam file with the -b option. It is running for over a day now and so far is at ~7%, and the memory usage is really low. I will let you know when it finishes.
Sergei
Sergei,
OK, glad to hear the memory usage is low, but it seems to be running very slowly. It should be producing output as it runs, can you check that and see if it makes sense?
John Gaspar
The output looks fine, although occasionally you can find what looks like an unmerged CpG: . . . . . Scaffold69178 70 71 100.000000 1 0 Scaffold69178 92 93 100.000000 1 0 Scaffold69178 166 167 5.555556 1 17 Scaffold69178 186 187 5.555556 2 34 Scaffold69178 197 198 0.000000 0 41 Scaffold69178 203 204 9.302326 4 39 Scaffold69178 205 206 0.000000 0 1 Scaffold69178 229 230 0.000000 0 2 Scaffold69178 283 284 15.789474 6 32 Scaffold69178 285 286 21.621622 8 29 Scaffold69178 298 299 9.375000 3 29 Scaffold69178 303 304 7.407407 2 25 Scaffold69178 307 308 9.090909 2 20 . . . . . but the occurrence is fairly low. Multi-threading option would help to speed things up?
Sergei
Sergei,
I do not think those are unmerged CpGs. If you check the genome sequence, it will probably be 'CGCG' at those positions.
I tested the script (with -b
) on a 32Gb BAM file and it finished in about 12 hours. This was using samtools 1.5 and python 2.7.12, maybe your versions are different.
John Gaspar
You are right, there are GCGC's. And slower speed is maybe due to the multiple scaffolds, or the hardware? I am using the same python version as you and samtools 1.3.1
Sergei
Sergei,
The multiple scaffolds will cause a little slow down, since each read's reference sequence is verified by the script. But that shouldn't take much time, even over millions of reads. It is probably a hardware issue. Hopefully it will finish in a day or two.
John Gaspar
Hi John,
the script is running for a week now, and it is about 30% through. Probably, another 2-3 weeks and it will be done.
:-)
Sergei
Sergei,
OK, that is rather absurdly slow. I cannot even guess what may be going on with your datasets and your computer. Even if it were running entirely in swap memory, it shouldn't be that slow.
John Gaspar
Computer is dell precision 7910, 8core 2.4Ghz, Oracle VM - it usually runs pretty well almost everything, some scripts do crash it though. 4-5G of RAM and 100% of one CPU core is used, and swap memory is not used at all. Well, I guess if it is running on your setup, something should be wrong with mine.
Sergei
Sergei,
Well, it could be the VM, but I'm not in a position to judge that. Regardless, I have released a new version of DMRfinder (0.3) here on GitHub, with the memory-saving options. I do not think there is more I can do here, so I'll close this issue.
Good luck,
John Gaspar
memory-saving version of the script, when analyzing a sorted SAM